Context-Aware Document Simplification

To date, most work on text simplification has focused on sentence-level inputs. Early attempts at document simplification merely applied these approaches iteratively over the sentences of a document. However, this fails to coherently preserve the discourse structure, leading to suboptimal output quality. Recently, strategies from controllable simplification have been leveraged to achieve state-of-the-art results on document simplification by first generating a document-level plan (a sequence of sentence-level simplification operations) and using this plan to guide sentence-level simplification downstream. However, this is still limited in that the simplification model has no direct access to the local inter-sentence document context, likely having a negative impact on surface realisation. We explore various systems that use document context within the simplification process itself, either by iterating over larger text units or by extending the system architecture to attend over a high-level representation of document context. In doing so, we achieve state-of-the-art performance on the document simplification task, even when not relying on plan-guidance. Further, we investigate the performance and efficiency tradeoffs of system variants and make suggestions of when each should be preferred.


Introduction
Text simplification transforms a given text into a simpler version of itself that can be understood by a wider audience, while preserving the same core meaning (Gooding, 2022).It has also proven useful as a preprocessing step for downstream NLP tasks such as machine translation (Chandrasekar et al., 1996;Mishra et al., 2014;Li and Nenkova, 2015;Štajner and Popovic, 2016) and relation extraction (Miwa et al., 2010;Niklaus et al., 2016).
Most previous work has focused on sentencelevel simplification by training neural models on complex/simple sentence pairs under the assumption that they will learn to perform required operations (e.g.sentence splitting, lexical substitution or syntactic rephrasing) implicitly from the training data (Zhang and Lapata, 2017;Nisioi et al., 2017;Jiang et al., 2020).However, the imbalanced representation of simplification operations throughout popular datasets, and the overly-conservative models arising from their use, have led to attempts at controllable simplification to achieve more variation and diversity in output texts (Alva-Manchego et al., 2017;Cripwell et al., 2021;Maddela et al., 2021).
Recently, strategies from controllable simplification have been leveraged to achieve state-of-the-art results on the document simplification task (Cripwell et al., 2023).Specifically, by using a planning model capable of considering the sentences surrounding a complex sentence, a sentence-level simplification model can be guided such that the structure of the resulting document remains more coherent.Despite this success, the sentence simplification model still has no direct access to document context which we believe limits the extent to which it can accurately produce simplified sentences that are consistent with the larger document.
As such, we propose various systems that allow access to some representation of surrounding content within the simplification module, while still allowing for the possibility of plan-guidance.We show that in doing so, we are able to achieve stateof-the-art document simplification performance on the Newsela dataset, even without relying on a generated plan.Further, we investigate the performance and efficiency tradeoffs of various system variants. 1ur key contributions are (i) a detailed investigation of how document context, input text and simplification plans impact document-level simplifica-tion and (ii) several state of the art models for document simplification.We show in particular that document level simplification is improved by combining a representation of the local context surrounding complex sentences with a simplification plan indicating how complex sentences should be simplified (whether they should be deleted, rephrased, split or copied).

Related Work
Context in Controlled Text Generation The use of external context within controlled text generation pipelines has seen recent success in areas outside of simplification.Li et al. (2021) control review generation by using document and sentencelevel plans in the form of knowledge graph subgraphs.Smith et al. (2020) control the style of generated dialog responses by conditioning on a desired style token appended to other contextual utterances.Hazarika et al. (2022) modulate the amount of attention paid to different parts of a dialog context and show that using contextual encoding of question phrases can guide a model to more often generate responses in the form of questions.Slobodkin et al. (2022) consider summarisation where salient spans are first identified before being used to control the generation, while Narayan et al. (2023) first generate a summarisation plan consisting of question-answer pairs.Simplification Planning Certain controllable sentence simplification works have approached simplification as a planning problem whereby an operation plan is first generated before being realised downstream to form the simplified text.The first of these are revision-based models that predict a sequence of token-level operations (delete, substitute, etc.), allowing for more control and interpretability (Alva-Manchego et al., 2017;Dong et al., 2019;Kumar et al., 2020;Omelianchuk et al., 2021;Dehghan et al., 2022).Others have taken a sentence-level approach by predicting a high-level operation (sentence split, rephrase, etc.) and using this to condition more typical neural systems (Scarton and Specia, 2018;Scarton et al., 2020;Garbacea et al., 2021;Cripwell et al., 2022).
Recently, the latter approach was leveraged for document simplification where it obtained state-ofthe-art performance (Cripwell et al., 2023).Here, a sequence of sentence-level operations is predicted for an entire document and then used to iteratively condition a sentence-level simplification model.
The system considers both local (token representation of the sentence) and global document context (sequence of sentence-level encodings) when predicting an operation for a given sentence.Document-Level Simplification.Initial attempts at document simplification simply applied sentence simplification methods iteratively over documents (Woodsend and Lapata, 2011;Alva-Manchego et al., 2019b;Sun et al., 2021).However, it was noted this alone is insufficient for performing certain operations, often leading to poor discourse coherence in the output (Siddharthan, 2003;Alva-Manchego et al., 2019b).
Various sub-problems of document simplification have been approached in isolation, such as sentence deletion (Zhong et al., 2020;Zhang et al., 2022), insertion (Srikanth and Li, 2021), and reordering (Lin et al., 2021).Sun et al. (2021) took a holistic approach by iteratively applying a sentencelevel model, but with additional encoders to embed the two preceding and following sentences, which are used as additional input during generation.However, this was unable to outperform baselines.
Recently, Cripwell et al. (2023) achieved stateof-the-art performance by producing a document simplification framework capable of performing all of the most common operations.Specifically, they use both high-level document context and sentencelevel features to generate a plan specifying which operations to be performed on each sentence in a given document, which is then used to condition a sentence simplification model.

Problem Formulation
The goal of text simplification is to generate a text S that simplifies an input text C. In the documentlevel case, C = c 1 . . .c n is a sequence of complex sentences and S = s 1 . . .s m is a sequence of simple sentences.Cripwell et al. (2023) further decompose this task into a two-stage process wherein a generated plan conditions the simplification: where O = o 1 . . .o n is a simplification plan, i.e. a sequence of sentence-level simplification operations for C (copy, rephrase, split, or delete).The motivation here is that the plan provides a highlevel description of how to transform C into S, which can in turn be used to guide the iterative generation of the simplified document across sentences.
Although the use of such plans has shown improved results, little attention has been given to how the generation stage itself can be modified to improve document-level simplification.In this work, we investigate whether further changes can be made to simplification models in order to make better use of high-level plans, or alternatively, whether it is possible to forego the planning stage entirely by incorporating high-level document context into the generative model directly.
Terminology and Notations.We use the following terminology and notational conventions: • p i is the ith paragraph from the complex document C; • S = s 1 . . .s m is a ground-truth simplified version of C, containing m sentences; • Ŝ = ŝ1 . . .ŝm is a predicted simplification of C, generated by a simplification model; • o is a simplification operation with value copy, rephrase, split, or delete; • Ô = ô1 . . .ôn is a predicted simplification plan stipulating specific sentence-level operations that should be applied to each c i ∈ C so as to arrive at some Ŝ; • Z i is a high-level representation of the document context for c i .It is a sequence of vector encodings for a fixed window of sentences surrounding c i within C.

Data
For all experiments, we use Newsela-auto (Jiang et al., 2020) which is currently the highest-quality document-level simplification dataset available.It consists of 1,130 English news articles from the original Newsela (Xu et al., 2015) dataset which are each manually rewritten at five different levels of simplification, corresponding to discrete reading levels (0-4) of increasing simplicity.It also includes both sentence and paragraph alignments for each document pair.Like previous works, for all our models we prepend a control-token to the input specifying the target document reading level.
We use the same filtered version of Newselaauto used in Cripwell et al. (2023), along with the same train/validation/test splits to allow for model comparison.This also includes plan labels, consisting of an operation (copy, rephrase, split, or delete) assigned to each sentence pair.Statistics of this data can be seen in Table 1.

Models
We distinguish three model categories: (i) models whose sole input is text and which simplify a document either by iterating over its sentences/paragraphs or by handling the entire document as a single input; (ii) models that take both a complex sentence and some representation of its document context as input and simplify a document by iterating over its sentences; and (iii) models that are guided by a plan via control-tokens denoting sentence-level simplification operations prepended to the input sequence.These are illustrated in Table 2 and presented in more detail in the following subsections.Additional training details are outlined in Appendix A.

Text-Only Models
The most basic group of models we test are those that simply take a text sequence as input.We use baseline models trained to take entire documents or individual sentences.We also experiment with using paragraph inputs, the results of which we believe should scale better to the document-level than isolated sentences.Because paragraphs contain a wider token-level representation of local context this might provide enough information to maintain coherency in the discourse structure of the final document.
BART.We finetune BART (Lewis et al., 2020) to perform simplification at the document (BART doc ), Table 2: Different system types and the specific forms of text, context, and plan inputs they consume.C is a complex document, c i is the ith sentence of C, and p i is the ith paragraph of C. Ô is a predicted document simplification plan, ôi is the individual operation predicted for the ith sentence, and ôj..j+|pi| is the plan extract for a specific paragraph p i , where j is the index of the first sentence in p i .
sentence (BART sent ), and paragraph (BART para ) levels. 2 Both BART sent and BART para are applied iteratively over a document and outputs are concatenated to form the final simplification result.
Longformer.Encoder-decoder models like BART often produce worse outputs and become much slower the longer the input documents are.Longformer (Beltagy et al., 2020) is one proposal that aims to overcome these limitations by using a modified self-attention mechanism that scales linearly with sequence length.We finetune a Longformer encoder-decoder to perform the simplification on documents (LED doc ) and paragraphs (LED para ). 3

Context-Aware Model (ConBART)
We propose a context-aware modification of the BART architecture (ConBART) that is able to condition its generation on both an input sentence c i and a high-level representation of its document context Z i (a sequence of vectors representing surrounding sentences in the document).This is done via extra cross-attention layers in each decoder attention block that specifically focus on Z i .The ConBART architecture is illustrated in Figure 1.We produce Z i by employing the same context representation strategy used for planning in Cripwell et al. (2023).Specifically, the document context is obtained by taking a fixed window of sentences surrounding the target c i , encoding 2 All models are initialised with the pretrained facebook/bart-base model from https://huggingface. co/facebook/bart-base.
them with Sentence-BERT (SBERT, (Reimers and Gurevych, 2019)), and applying custom positional embeddings to represent location within the document.
By generating the plan autoregressively, it is also possible to use previously simplified sentences within the left context of the current complex sentence, a method we refer to as dynamic context.In this case, the window of sentences represented within Z i is defined: where r is the context window radius and ŝi is the simplification output for the ith sentence c i .We use the same recommended setting of r = 13.The intuition behind the ConBART architecture is that the contextual information should allow for the simplification model to implicitly learn useful features of the discourse structure of the document in a similar way to the planner in Cripwell et al. (2023).

Plan-Guided Systems
Existing System.We compare with the state-ofthe-art system proposed by Cripwell et al. (2023), PG Dyn , which consists of a standard sentence-level BART model that is guided by a planner, which predicts the simplification operation to be applied each input sentence given a left and right context window of sentence representations, Z i .The planner uses dynamic document context, allowing it to auto-regressively update the left context part of Z i during planning as each sentence is simplified (see Equation 1).Pipelines.We construct pipeline systems that consist of each of our proposed models, guided by a document plan generated by the planner from Cripwell et al. ( 2023) (the same as is used by PG Dyn ).
For this, we use modified versions of each simplification model that are trained to take an operation control-token at the beginning of each text input.
We refer to each of these pipeline systems as Ô → h, where h is the simplification model.We also report results where the ground-truth/oracle plans are used to condition models (O → h).
Note that because the planner updates its document context autoregressively at the sentence-level, this does not interface perfectly with paragraphlevel simplification models.As such, for pipelines using a paragraph-level simplification model ( Ô → BART para , Ô → LED para ), we only update the planner's context after each paragraph has been processed.Thus, for those paragraph level models, the left context of a complex sentence c i is only simplified up to the first sentence of the paragraph containing c i , i.e.
where j is the index of the first sentence within the same paragraph as c i , assuming j > i − r.
We also experimented with multi-task systems that are trained to perform both planning and sim-plification within a single model, therefore not requiring a pipeline setup.However, this ultimately proved unsuccessful (further details in Appendix B).

Automatic Evaluation
Text simplification is often evaluated on the basis of 3 criteria: adequacy (or meaning preservation), fluency, and simplicity.For automatic evaluation, we use BARTScore (Yuan et al., 2021) and SMART (Amplayo et al., 2022) as analogs for both adequacy and fluency.Both are reference-based metrics that have previously been used for document simplification as well as other text generation tasks.
For assessing simplicity, we use both the Flesch-Kincaid grade level (FKGL) and SARI (Xu et al., 2016).FKGL is a document-level metric of text readability that has the highest correlation with human judgements (Scialom et al., 2021), while SARI is a simplification metric that has become a staple in the sentence-level simplification literature.We use EASSE (Alva-Manchego et al., 2019a) to calculate both of these.
At test time we generate sequences using beam search with a beam size of 5 and a maximum length of 1024 tokens.

Human Evaluation
Historically, automatic evaluation of long-form text generation has been very difficult to perform (Howcroft et al., 2020;Thomson and Reiter, 2020).As such, we conduct a human-evaluation of proposed systems to more accurately gauge performance.
As full documents are very long and difficult to compare, we conduct evaluations at the paragraphlevel.For each comparison, a complex paragraph is shown next to an extract from a generated simplification corresponding to that paragraph.Evaluators are then asked to judge whether the generated text (i) is fluent (fluency); (ii) preserves the core meaning of the input (adequacy); and (iii) is simpler to read/understand (simplicity).
Using the test set, we randomly sample 33 complex paragraphs from each non-adjacent readinglevel transition pairing, for a total of 198 paragraphs.We take the references and outputs from 4 high performing systems (PG Dyn , LED para , Ô → LED para , Ô → ConBART) for each (990 outputs in total) and have an annotator rate them on each of the 3 criteria.Because we use a large pool of annotators we impose a binary answering scheme (yes/no) in order to avoid the inter-annotator subjectivity that is inherent when using a Likert scale.The proportion of positive results is used as the final score for a given system.
Further details of the human evaluation are given in Appendix C.

Results and Discussion
Results are shown in Table 3.We also report results for other commonly used metrics in Appendix D.
Context Awareness Matters.Considering all metrics, we find that text-only models that take as input either a sentence (BART sent ) or a whole document (BART doc , LED doc ) underperform models whose input is more local to the input sentence, either because they work at the paragraph level (LED para ) or because they take both the complex sentence and its local document context as input (ConBART).In other words, models that have access to a local document context (LED para , Con-BART) perform best overall.
LED vs BART.LED models (LED doc/para ) outperform their standard counterpart (BART doc/para ) showing that modified self-attention is not only more efficient but also more precise than standard self-attention in the case of long input.
The Utility of Planning.Plan guided models (4th horizontal block in Table 3) outperform their standard couterpart on all metrics, showing that a predicted plan has a positive impact on simplification.This is further supported by the fact that models guided by an oracle plan (5th block) provide even greater performance.
Comparison with the State-of-the-Art (PG Dyn ).O → ConBART is similar to PG Dyn in that, in both cases, a document is simplified by iterating over its sentences and prediction is guided by the local context of the sentence to be simplified.A key difference is that in PG Dyn , this context is exclusively used to predict a simplification operation, while in O → ConBART it is additionally used to condition the generation of the simplified sentences.We find that adding this extra control results in significantly better scores compared to the state-of-the-art PG Dyn model.This illustrates that document context has utility for both planning (predicting the cor- Table 4: Human evaluation results for selected simplification systems.The minor group includes those examples with a reading-level transition of 2 levels (e.g.0-2, 1-3, etc.), whereas the major class includes those of 3-4 levels.Each of these groups make up half of the entire set.Ratings significantly different from the highest score in each column are denoted with * (p < 0.05) and * * (p < 0.01).Significance was determined with two proportion Z-tests.
rect simplification operation) and realisation (simplifying a given sentence).While Ô → LED para achieves the best overall results of any system, it is slightly outperformed by O → ConBART when oracle plans are used, suggesting that an improved planner would provide better simplifications when used by Ô → ConBART over Ô → LED para .
Human Evaluation Results from the human evaluation are shown in Table 4.To better identify where each model excels, we report separate scores for test paragraph pairs with minor (readinglevel transition of 2) and major (>2) degrees of simplification, as well as total average scores.
On fluency, all of the systems achieve very high ratings, which is unsurprising given the recognised ability of large language models (LLMs) to produce highly fluent texts.For adequacy, Ô → LED para achieves the highest overral score, closely followed by LED para and Ô → ConBART.In terms of simplicity, LED para and Ô → ConBART equally achieve the highest score.Across all criteria, LED para achieves the highest average ratings, although very few scores are significantly better than other systems.
When considering performance differences between the minor and major simplification groups, we observe some clear trends.Systems that are not guided by a high-level document plan or do not have access to some contextual information during generation (PG Dyn and LED para ) perform notably worse on examples requiring major simplification than they do on minor cases.Conversely, the models with both of these features appear to either perform equally as well or even excel on major cases.This suggests potential conservativity in the simplifications performed by PG Dyn and LED para .
Another interesting observation is the relatively low ratings given to the references compared to the system outputs.In particular, they receive a much lower adequacy score than any other system on major cases.This could perhaps be a result of the systems generating outputs that bear more of a resemblance to the inputs than those written by humans (see faithfulness BARTScores in Appendix D).For instance, human editors might have been able to confidently delete more content, or refer to some of the information in different paragraphs which the evaluators were not privy to.Despite this, the references still receive fluency ratings competitive with the other systems.
Example Simplifications Figure 2 shows some example simplification outputs from Ô → LED para .
These are paragraph-level extracts from larger document outputs, which are provided in Appendix E. Due to licensing constraints imposed by Newsela, we use out-of-domain documents from the Wiki-Large dataset (Zhang and Lapata, 2017) in these examples.

Model Efficiency
There are various others factors to consider when comparing systems, beyond their raw performance.
For instance, the size of the model(s) and how much time/resources are required for each to perform inference are important practical considerations that must be made when selecting a model for realworld use.As such, we compare each system based on the time taken to simplify the test set and their total parameter counts.Table 5 shows these results.
In our case, any system that uses a plan requires a second model, approximately doubling the number of parameters that must be loaded.These pipeline setups also naturally add to overall inference time.Further, both plan pipelines and ConBART make use of dynamic context, which imposes an autoregressive bottleneck on the simplification of individual documents.
Because of the linearly scaling attention mechanism, Longformer-based models are the fastest of proposed systems.Because of this and its overall high performance, we recommend LED para in situations where time or computing resources are at all limited.Alternatively, Ô → ConBART offers a good compromise that provides the high performance of a plan-guided system while mitigating further increases to inference time.This is because it uses the same autoregressive protocol as the planner and can therefore share the generated context representations.
All inference processes were run on a single Nvidia A40 GPU, using a batch size of 16, 32 CPU workers for data loading, and a beam size of 5 for generation.Appendix F provides details on the specific algorithm used to handle dynamic context generation for appropriate models.

Conclusion
We develop a range of document simplification models that are able to use different combinations of text, context, and simplification plans as input, with several models outperforming the previous state-of-the-art both on automatic metrics and according to human judgements.Our results show that a high-level representation of the document can be useful for low-level surface realisation as well as global planning.Further, simplification models with access to local document context, either by working at the paragraph level or handling an additional input representation, lead to better meaning preservation than those that operate on individual sentences.We conclude by evaluating the model efficiency of each system and making recommendations for their selection under different circumstances.

Limitations
Newsela Dataset One limitation to this study is our use of the Newsela dataset.Because this requires a license to access, researchers cannot fully reproduce our work without first obtaining permission from Newsela Inc. Unfortunately there is currently no other large dataset offering high quality aligned documents for simplification under an open source license.The only other datasets so far used for document-level simplification are based on Wik-iLarge, which has very poor and inconsistent alignments at the document-level (Xu et al., 2015;Sun et al., 2021;Cripwell et al., 2023).
Paragraph-Level Human Evaluation In order to reduce complexity, our human evaluation was performed on paragraphs rather than full documents.As a result, there is a potential limit to the accuracy of human judgements when certain discourse phenomena are present.For example, important information may be excluded from a specific output paragraph (therefore prompting a low adequacy rating), but this could actually be present in a different part of the true simplified document.
Monolinguality This study focused entirely on simplification for English-language documents.Reproducing the proposed systems for use on other languages would require dedicated datasets of similar scale, along with sentence/paragraph alignments and operation labels (which likely do not currently exist).Further, the nature of simplification in other languages may differ quite a lot from English with respect to the types of operations that are performed, potentially reducing the suitability of the proposed framework.

Generalised Target Audience
We approach this study with our definition of "simplification" being based on that of a generalised audience, following the standard set out by the assigned reading-levels of the Newsela dataset.Existing works often outline the intent for their systems to be used to simultaneously assist a wide array of different target users, such as those with cognitive impairments, non-native speakers, and children (Maddela et al., 2021;Garbacea et al., 2021;Sun et al., 2021).However, they rarely go into any detail about which simplification strategies work for each of these different groups or perform human evaluation with annotators from the same target demographics (Gooding, 2022).As such, we acknowledge that using our systems for a specific demographic might prove insufficient to enable their consumption of media without first making further revisions to support their precise needs.

A Training Details
For all simplification models, we used a learning rate of 2e −5 , a batch size of 16, and a 0.1 dropout rate.All models were trained on a computing grid using 2× Nvidia A40 GPUs (45GB memory) until convergence or a maximum of 48 hours.
For ConBART and planning pipelines we use the same settings as Cripwell et al. (2023) for construction of the high-level document context.Specifically, this includes a fixed context window radius of size 13 and use of a dynamic context mechanism.

B Multi-Task Systems
We also experimented with models that are explicitly trained to perform both the planning and simplification tasks using the same network.As highlevel plans appear to improve the performance of simplification models, we hypothesise that learning both tasks in tandem could benefit overall performance.The motivation for this approach is to potentially produce a model that is capable of yielding similar or better simplification performance to the pipeline systems but with a more efficient singlemodel setup.
Specifically, these models were trained to generate the simplified text prefixed by a predicted plan in the form of operation-specific tokens.This was tested with both ConBART (ConBART prefix ) and a document-level Longformer (LED prefix ).In the case of the Longformer we also test a variant that generates the plan tokens as sentence separators (LED sep ).Results are shown in Table 6.
Unfortunately, from our experiments none of these seemed to result in performance exceeding those of simplification-only models.Improvement could perhaps be reached given the correct tuning of hyperparameters and loss weightings, however we did not have the time or resources to pursue this further in this study.

C Human Evaluation Details
The Newsela-auto paragraph alignments were used to identify valid references for each test paragraph.In order to align correct extracts from generated system outputs we took different steps depending on the system.For paragraph-level models (those using LED para ), we simply use the full simplification output for each source paragraph.For sentencelevel models (ConBART, PG Dyn ), we first used the alignments to identify which paragraph the source sentence belongs to, then concatenated their simplification results.
Human judgements were crowdsourced on the MTurk platform.We sourced workers from English speaking countries (AU, CA, GB, IE, NZ, US) and paid them $0.2 USD for each individual evaluation.We ran an initial test ourselves and timed how many evaluations could be completed within an hour.According to this, subjects should earn approximately $18 USD per hour (which is above the minimum wage in all of these countries).The form and instructions presented to human evaluators is shown in Figure 3.

D Additional Evaluation Results
In Table 7 we provide additional results for popular automatic evaluation metrics that were not included in the main text.Specifially, we include BLEU (Papineni et al., 2002), and full operationspecific scores for SARI.In general, the results are similar to those in Table 3, with Ô → LED para and Ô → ConBART achieving the best results.
Faithfulness BARTScore is included for clarity rather than being a direct estimation of output quality.It shows how semantically similar system outputs are to their inputs, roughly equating to a measurement of conservativity.

E Example Simplifications
Figure 4 shows several example simplifications by the Ô → LED para system on full documents.Due to licensing constraints imposed by Newsela, we use out-of-domain documents from the WikiLarge dataset here.As these are Wikipedia articles they are quite different in tone than the Newsela articles as well as being much shorter in length.Regardless, we still believe this provides clarity on the types of editing performed by the model.

F Dynamic Context Algorithm
Algorithm 1 shows the process used to handle dynamic context generation for appropriate models.As each document needs to be simplified autoregressively at the sentence level, we construct batches of sentences with the same index from different documents in order to speed up processing.Note that this could potentially be further optimised (e.g. via parallelism) and merely serves as a reasonable baseline algorithm.preds ← h(sents,context) 13: context ← update_context(preds) 14: end for 15: end procedure

Figure 1 :
Figure 1: ConBART model architecture.The added context attention layer is shown in yellow, which allows for cross-attention over high-level document content, Z i .

Figure 2 :
Figure 2: Example WikiLarge simplification output extracts from Ô → LED para (a target readinglevel of 3 was used in each case).Note that these are small extracts from larger documents shown in Appendix E. Deletions are underlined and in red; rephrasings are italicised and in green; splitting points are highlighted in cyan; and factual errors are circled.

Figure 4 :
Figure 4: Example simplification outputs for Ô → LED para , illustrating both strong and poor performances (a target reading-level of 3 was used for all examples).Input documents are taken from WikiLarge due to licensing constraints around sharing Newsela content.Deletions are underlined and in red; rephrasings are italicised and in green; splitting points are highlighted in cyan; and factual errors are circled.

Table 3 :
Results * -1.33 * -1.35 * 87.0 * 87.3 * 87.1 * of document simplification systems on Newsela-auto.For BARTScore, h is the hypothesis and r is the reference.Scores significantly higher than PG Dyn are denoted with * (p < 0.005).Significance was determined with Student's t-tests.

Table 5 :
Model efficiency statistics.All times are in milliseconds and model parameters are in millions.Inference times are calculated on the test set and normalised by the total number of sentences (i.e.# ms per sentence).

Table 6 :
Results of multi-task systems on the Newsela-auto test set.
Figure 3: Submission form used in human evaluation.System BARTScore BLEU ↑ SARI ↑ add keep delete Faith.(s → h)

Table 7 :
Extra automatic evaluation results on Newsela-auto.For BARTScore, s is the source text and h is the hypothesis.Scores significantly higher than PG Dyn are denoted with * (p < 0.005).Significance was determined with Student's t-tests.