Investigating Text Simplification Evaluation

Modern text simplification (TS) heavily relies on the availability of gold standard data to build machine learning models. However, existing studies show that parallel TS corpora contain inaccurate simplifications and incorrect alignments. Additionally, evaluation is usually performed by using metrics such as BLEU or SARI to compare system output to the gold standard. A major limitation is that these metrics do not match human judgements and the performance on different datasets and linguistic phenomena vary greatly. Furthermore, our research shows that the test and training subsets of parallel datasets differ significantly. In this work, we investigate existing TS corpora, providing new insights that will motivate the improvement of existing state-of-the-art TS evaluation methods. Our contributions include the analysis of TS corpora based on existing modifications used for simplification and an empirical study on TS models performance by using better-distributed datasets. We demonstrate that by improving the distribution of TS datasets, we can build more robust TS models.


Introduction
Text Simplification transforms natural language from a complex to a simple format, with the aim to not only reach wider audiences (Rello et al., 2013;De Belder and Moens, 2010;Aluisio et al., 2010;Inui et al., 2003) but also as a preprocessing step in related tasks (Shardlow, 2014;Silveira and Branco, 2012).
Simplifications are achieved by using parallel datasets to train sequence-to-sequence text generation algorithms (Nisioi et al., 2017) to make complex sentences easier to understand. They are typically produced by crowdsourcing (Xu et al., 2016;Alva-Manchego et al., 2020a) or by alignment (Cao et al., 2020;Jiang et al., 2020). They are infamously noisy and models trained on these give poor results when evaluated by humans (Cooper and Shardlow, 2020). In this paper we add to the growing narrative around the evaluation of natural language generation (van der Lee et al., 2019;Caglayan et al., 2020;Pang, 2019), focusing on parallel text simplification datasets and how they can be improved.
Why do we need to re-evaluate TS resources?
In the last decade, TS research has relied on Wikipedia-based datasets (Zhang and Lapata, 2017;Xu et al., 2016;Jiang et al., 2020), despite their known limitations (Xu et al., 2015;Alva-Manchego et al., 2020a) such as questionable sentence pairs alignments, inaccurate simplifications and a limited variety of simplification modifications. Apart from affecting the reliability of models trained on these datasets, their low quality influences the evaluation relying on automatic metrics that requires goldstandard simplifications, such as SARI (Xu et al., 2016) and BLEU (Papineni et al., 2001).
Hence, evaluation data resources must be further explored and improved to achieve reliable evaluation scenarios. There is a growing body of evidence (Xu et al., 2015) (including this work) to show that existing datasets do not contain accurate and well-constructed simplifications, significantly impeding the progress of the TS field.
Furthermore, well-known evaluation metrics such as BLEU are not suitable for simplification evaluation. According to previous research (Sulem et al., 2018) BLEU does not significantly correlate with simplicity (Xu et al., 2016), making it inappropriate for TS evaluation. Moreover, it does not correlate (or the correlation is low) with grammaticality and meaning preservation when performing syntactic simplification such as sentence splitting. Therefore in most recent TS research BLEU has not been considered as a reliable evaluation metric. We use SARI as the preferred method for TS eval-uation, which has also been used as the standard evaluation metric in all the corpora analysed in this research.
Our contributions include 1) the analysis of the most common TS corpora based on quantifying modifications used for simplification, evidencing their limitations and 2) an empirical study on TS models performance by using better-distributed datasets. We demonstrate that by improving the distribution of TS datasets, we can build TS models that gain a higher SARI score in our evaluation setting.

Related Work
The exploration of neural networks in TS started with the work of Nisioi et al. (2017), using the largest parallel simplification resource available (Hwang et al., 2015). Neural-based work focused on state-of-the-art deep learning and MT-based methods, such as reinforcement learning (Zhang and Lapata, 2017), adversarial training (Surya et al., 2019), pointer-copy mechanism (Guo et al., 2018), neural semantic encoders (Vu et al., 2018) and transformers supported by paraphrasing rules (Zhao et al., 2018).
Other successful approaches include the usage of control tokens to tune the level of simplification expected (Alva-Manchego et al., 2020a;Scarton and Specia, 2018) and the prediction of operations using parallel corpora (Alva-Manchego et al., 2017;Dong et al., 2020). The neural methods are trained mostly on Wikipedia-based sets, varying in size and improvements in the quality of the alignments. Xu et al. (2015) carried out a systematic study on Wikipedia-based simplification resources, claiming Wikipedia is not a quality resource, based on the observed alignments and the type of simplifications. Alva-Manchego et al. (2020a) proposed a new dataset, performing a detailed analysis including edit distance and proportion of words that are deleted, inserted and reordered, and evaluation metrics performance for their proposed corpus.
Chasing the state-of-the-art is rife in NLP (Hou et al., 2019), and no less so in TS, where a SARI score is too often considered the main quality indicator. However, recent work has shown that these metrics are unreliable (Caglayan et al., 2020) and gains in performance according to them may not deliver improvements in simplification performance when the text is presented to an end user.
We computed the number of changes between the original and simplified sentences through the token edit distance. Traditionally, edit distance quantifies character-level changes from one character string to another (additions, deletions and replacements). In this work, we calculated the tokenbased edit distance by adapting the Wagner-Fischer algorithm (Wagner and Fischer, 1974) to determine changes at a token level. We preprocessed our sentences by changing them into lowercase prior to this analysis. To make the results comparable across sentences, we divide the number of changes by the length of the original sentence and obtain values between 0% (no changes) to 100% (completely different sentence).
In addition to toked-based edit operation experiments, we analysed the difference of sentence length between complex and simple variants, the quantity of edit operations type (INSERT, DELETE and REPLACE) and an analysis of redundant operations such as deletions and insertions in the same sentence over the same text piece (we define this as the MOVE operation). Based on our objective to show how different split configurations affect TS model performance, we have presented the percentage of edit operations as the more informative analysis performed on the most representative datasets.

Edit Distance Distribution
Except for the recent work of Alva-Manchego et al. (2020b), there has been little work on new TS datasets. Most prior datasets are derived by aligning English and Simple English Wikipedia, for example WikiSmall and WikiLarge (Zhang and Lapata, 2017).
In Figure 1 we can see that the edit distance distribution of the splits in the selected datasets is not even. By comparing the test and development subsets in WikiSmall (Figure 1a) we can see differences in the number of modifications involved in simplification. Moreover, the WikiLarge dataset (a) WikiSmall Test/Dev/Train ( Figure 1b) shows a complete divergence of the test subset. Additionally, it is possible to notice a significant number of unaligned or noisy cases, between the 80% and 100% of change in the WikiLarge training and validation subsets (Figure 1b).
We manually checked a sample of these cases and confirmed they were poor-quality simplifications, including incorrect alignments. The simplification outputs (complex/simple pairs) were sorted by their edit distances and then manually checked to determine an approximate heuristic for noisy sentences detection. Since many of these alignments had really poor quality, it was easy to determine the number that removed a significant number of cases without actually reducing dramatically the size of the dataset.
Datasets such as Turk Corpus (Xu et al., 2015) are widely used for evaluation and their operations mostly consist of lexical simplification (Alva-Manchego et al., 2020a). We can see this behaviour in Figure 1c, where most edits involve a small percentage of the tokens. This can be noticed when a large proportion of the sample cases are between 0% (no change) to 40%.
In the search of better evaluation resources, Turk-Corpus was improved with the development of ASSET (Alva-Manchego et al., 2020a) including more heterogeneous modification measures. As we can see in Figure 1e, the data are more evenly distributed than in Figure 1c.
Recently proposed datasets, such as WikiManual (Jiang et al., 2020), as shown in Figure 1f, have an approximately consistent distribution, and their simplifications are less conservative. Based on a visual inspection on the uppermost values of the distribution (≈80%), we can tell that often most of the information in the original sentence is removed or the target simplification does not express accurately the original meaning.
MSD dataset (Cao et al., 2020) is a domainspecific dataset, developed for style transfer in the health domain. In the style transfer setting, the simplifications are aggressive (i.e., not limited to individual words), to promote the detection of a difference between one style (expert language) and another (lay language). Figure 1d shows how their change-percentage distribution differs dramatically in comparison to the other datasets, placing most of the results at the right-side of the distribution.
Among TS datasets, it is important to mention that the raw text of the Newsela (Xu et al., 2015) dataset was produced by professional writers and is likely of higher quality than other TS datasets. Unfortunately, it is not aligned at the sentence level by default and its usage and distribution are limited by a restrictive data agreement. We have not included this dataset in our analysis due to the restrictive licence under which it is distributed.

KL Divergence
In addition to edit distance measurements presented in Figure 1, we further analysed KL divergence (Kullback and Leibler, 1951) of those distributions to understand how much dataset subsets diverge. Specifically, we compared the distribution of the test set to the development and training sets for WikiSmall, WikiLarge, WikiManual, TurkCorpus and ASSET Corpus (when available). We did not include MSD dataset since it only has a testing set.
We performed randomised permutation tests (Morgan, 2006) to confirm the statistical significance of our results. Each dataset was joined together and split randomly for 100,000 iterations. We then computed the p-value as a percentage of random splits that result in the KL value equal to or higher than the one observed in the data. Based on the p-value, we can decide whether the null hypothesis (i.e. that the original splits are truly random) can be accepted. We reject the hypothesis for p-value lower than 0.05. In Table 1 we show the computed KL-divergence and p-values. The p-values below 0.05 for WikiManual and WikiLarge confirm that these datasets do not follow a truly random distribution.

Simplification Datasets: Experiments
We carried out the following experiments to evaluate the variability in performance of TS models caused by the issues described in Wiki-based data.

Data and Methods
For the proposed experiments, we used the EditNTS model, a Programmer-Interpreter Model (Dong et al., 2020). Although the original code was published, its implementation required minor modifications to run in our setting. The modifications performed, the experimental subsets as well as the source code are documented via GitHub 1 . We selected EditNTS model due to its competitive performance in both WikiSmall and WikiLarge datasets 2 . Hence, we consider this model as a suitable candidate for evaluating the different limitations of TS datasets. In future work, we will definitely consider testing our assumptions under additional metrics and models.
In relation to TS datasets, we trained our models on the training and development subsets from WikiLarge and WikiSmall, widely used in most of TS research. In addition, these datasets have a train, development and test set, which is essential for retraining and testing the model with new split configurations. The model was first trained with the original splits, and then with the following variations: Randomised split: as explained in Section 3.3, the original WikiLarge split does not have an even distribution of edit-distance pairs between subsets. For this experiment, we resampled two of our datasets (WikiSmall and WikiLarge). For each dataset, we joined all subsets together and performed a new random split.
Refined and randomised split: we created subsets that minimise the impact of poor alignments. These alignments were selected by edit distance and then subsets were randomised as above. We presume that the high-distance cases correspond to noisy and misaligned sentences. For both Wik-iSmall and WikiLarge, we reran our experiments removing 5% and 2% of the worst alignments.
Finally, we evaluated the models by using the test subsets of external datasets, including: Turk-Corpus, ASSET and WikiManual. Figure 2 shows the results for WikiSmall. We can see a minor decrease in SARI score with the random splits, which means that the noisy alignments were equivalently present in all the sets rather than using the best cases for training. On the other hand, when the noisy cases are removed from the datasets the increase in model performance is clear.

Discussion
Likewise, we show WikiLarge results in Figure  3. When the data is randomly distributed, we obtain better performance than the original splits. This is consistent with WikiLarge having the largest discrepancy according to our KL-divergence measurements, as shown in Section 3.3. We also found that the 95% split gave a similar behaviour to Wiki-Large Random. Meanwhile, the 98% dataset, gave a similar performance to the original splits for AS-SET and TurkCorpus 3 .
We can also note, that although there is a performance difference between WikiSmall Random and WikiSmall 95%, in WikiLarge the same splits have quite similar results. We believe these discrepancies are related to the size and distribution of the training sets. WikiLarge subset is three times bigger than WikiSmall in the number of simple/complex pairs. Also, WikiLarge has a higher KL-divergence (≈0.46) than WikiSmall (≈0.06), which means that WikiLarge could benefit more from a random distribution experiment than Wik-iSmall, resulting in higher performance on Wiki-Large. Further differences may be caused by the procedures used to make the training/test splits in the original research, which were not described in the accompanying publications.
Using randomised permutation testing, we have confirmed that the SARI differences between the models based on the original split and our best alternative (95% refined) is statistically significant (p < 0.05) for each configuration discussed above.
In this study, we have shown the limitations of TS datasets and the variations in performance in different splits configurations. In contrast, existing evidence cannot determine which is the most suitable split, especially since this could depend on each specific scenario or target audience (e.g., model data similar to "real world" applications). Also, we have measured our results using SARI, not only because it is the standard evaluation metric in TS but also because there is no better automatic alternatives to measure simplicity. We use SARI as a way to expose and quantify SOTA TS datasets limitations. The increase in SARI scores should be interpreted as the variability in the relative quality of the output simplifications. By relative we mean, that there is a change in simplicity gain but we cannot state the simplification is at its best quality since the metric itself has its own weaknesses.

Conclusions
In this paper, we have shown 1) the statistical limitations of TS datasets, and 2) the relevance of subset distribution for building more robust models. To our knowledge, distribution-based TS datasets analysis has not been considered before. We hope that the exposure of these limitations kicks off a discussion in the TS community on whether we are in the correct direction regarding evaluation resources in TS and more widely in NLG. The creation of new resources is expensive and complex, however, we have shown that current resources can be refined, motivating future studies in the field of TS.