Simplicity Level Estimate (SLE): A Learned Reference-Less Metric for Sentence Simplification

Automatic evaluation for sentence simplification remains a challenging problem. Most popular evaluation metrics require multiple high-quality references -- something not readily available for simplification -- which makes it difficult to test performance on unseen domains. Furthermore, most existing metrics conflate simplicity with correlated attributes such as fluency or meaning preservation. We propose a new learned evaluation metric (SLE) which focuses on simplicity, outperforming almost all existing metrics in terms of correlation with human judgements.


Introduction
Text simplification involves the rewriting of a text to make it easier to read and understand by a wider audience, while still expressing the same core meaning.This has potential benefits for disadvantaged end-users (Gooding, 2022), while also showing promise as a preprocessing step for downstream NLP tasks (Miwa et al., 2010;Mishra et al., 2014;Štajner and Popovic, 2016;Niklaus et al., 2016).Although some recent work considers simplification of entire documents (Sun et al., 2021;Cripwell et al., 2023a,b) the majority of work focuses on individual sentences, given the lack of high-quality resources (Nisioi et al., 2017;Martin et al., 2020Martin et al., , 2021)).
A major limitation in evaluating sentence simplification is that most popular metrics require highquality references, which are rare and expensive to produce.This also makes it difficult to assess models on new domains where labeled data is unavailable.Another limitation is that many metrics evaluate simplification quality by combining multiple criteria (fluency, adequacy, simplicity) which makes it difficult to determine where exactly systems succeed and fail, as these criteria are often highly correlated -meaning that high scores could be spurious indications of simplicity (Scialom et  2021b).Table 1 describes how popular metrics conform to various desirability standards.
We propose SLE (Simplicity Level Estimate), a learned reference-less metric that is trained to estimate the simplicity of a sentence.1 Different from reference-based metrics (which estimate simplicity with respect to a reference), SLE can be used as an absolute measure of simplicity, a relative measure of simplicity gain compared to the input, or to measure error with respect to a target simplicity level.In this short paper, we focus on simplicity gain with respect to the input and show that SLE is highly correlated with human judgements of simplicity, competitive with the best performing reference-based metric.We also show that, when controlling for meaning preservation and fluency, many existing metrics used to assess simplifications do not correlate well with human ratings of simplicity.

A Metric for Simplicity
The SLE Metric.We propose SLE, a learned metric which predicts a real-valued simplicity level for a given sentence without the need for references.Given some sentence t, the system predicts a score SLE(t) ∈ R, with high values indicating higher simplicity.This can not only be used as an absolute measure of simplicity for system output ŷ, but also to measure the simplicity gain relative to input x: In this paper we primarily focus on ∆SLE, as it is the most applicable variant under common sentence simplification standards.
Model.As the basis for the metric, we fine-tune a pretrained RoBERTa model2 to perform regression over simplicity levels given sentence inputs, using a batch size of 32 and lr = 1e −5 .We ran training experiments on a computing grid with a Nvidia A40 GPU.
Data.We use Newsela (Xu et al., 2015), which consists of 1,130 news articles manually rewritten at five discrete reading levels (0-4), each increasing in simplicity.Existing works often assume sentences have the same reading level as the document they are from (Lee and Vajjala, 2022;Yanamoto et al., 2022); however, we expect there to be a lot of variation in the simplicity of sentences within documents and overlap across levels.As such, merely training to minimize error with respect to these labels would likely result in mode collapse within levels (peaky, low-entropy distribution) and strong overfitting to the Newsela corpus.To address this mismatch between document-and sentence-level simplicity, we take the following two mitigating steps to allow the model to better differentiate between sentences from the same reading level.
Label Softening.We attempt to mitigate peakiness in the output distribution by softening the quantized reading levels assigned to each sentence in the training data.Specifically, we interpolate regression labels throughout overlapping class regions (±1) according to their Flesch-Kincaid grade level (FKGL) (Kincaid et al., 1975).FKGL is a readability metric often used in education as a means to judge the suitability of books for students (high values =⇒ high complexity).
If L is the set of sentences belonging to some reading level, we define an intra-level ranking according to re-scaled, negative FKGLs: where f L,i is the revised FKGL score of sentence x i .Intuitively, this inverts FKGL scores (so that higher values = higher simplicity) and rescales them to be ∈ [0, 2].The [0, 2] scaling is used in order for the distribution of final scores in each reading level to have a ±1 variance and overlap with adjacent groups (see Figure 1 for a visual representation).
From this, we derive the final revised labels: where f L is the mean of f L , l L,i is the reading level for the ith sentence of L, and l L,i is its revised soft version. 3or example, if the original document has a reading level of 3, and one of the sentences has a revised FKGL (Equation 2) of 1.5, then the softened label for that sentence will therefore be 3.5 (Equation 3). Figure 1 shows the distributional differences between the original reading levels and the resulting softened versions for the training data.
We report results for a model using softened labels (SLE) as well as a variant using the original quantized labels (SLE Z ).Document-Level Optimization.Given that Newsela reading levels are assigned at the document level, the labels of individual sentences are likely noisy, but approach the document label on average.We therefore observe and perform early stopping with respect to the document-level validation MAE (Mean Absolute Error) and use a train/dev/test split (90/5/5) that keeps sentences from all versions of a given article together.The size of each data split is illustrated in Table 2.

Similarity Metrics
We compare SLE with four reference-based and two reference-less metrics previously used to assess the output of simplification models.Table 1 summarizes their main features.

SARI.
The most commonly used evaluation metric is SARI (Xu et al., 2016), which compares ngram edits between the output, input and references.Despite its widespread usage, SARI has known limitations.The small set of operations it considers   makes it much more focused towards lexical simplifications, showing very low correlations with human ratings in cases where structural changes (e.g.sentence splitting) have occurred (Sulem et al., 2018).As it is token-based, it is totally reliant on the references, without any robustness towards synonomy.
BERTScore.Zhang et al. (2019) present BERTScore, which overcomes some of these shortcomings given its use of embeddings to compute similarities.It has been found to correlate highly with human ratings of simplicity (Alva-Manchego et al., 2021), but still requires references.It is reportedly worse than SARI at differentiating conservative edits (Maddela et al., 2022) and its high correlation with simplicity ratings may be spurious (Scialom et al., 2021b (Scialom et al., 2021b).Unlike most other metrics, it does not explicitly consider the adequacy and fluency dimensions, as it is reference-less and assumes the text is already well-formed (Xu et al., 2016).
QUESTEVAL.Scialom et al. (2021a) propose QUESTEVAL, a reference-less metric that compares two texts by generating and answering questions between them.Although originally intended for summarization, it has shown some promise as a potential meaning preservation metric for simplification (Scialom et al., 2021b).

Evaluation
We evaluate SLE both in terms of its ability to perform the regression task and how well it correlates with human judgements of simplicity.For the latter we consider ∆SLE, as this conforms with what human evaluators were asked when giving ratings (to measure simplicity gain vs. the input).
Regression.To evaluate regression models we consider (i) the MAE with respect to the original quantized reading levels, (ii) the document-level error when averaging all sentence estimates from a given document (Doc-MAE), and (iii) the F1 score as if performing a classification task, after rounding estimates.We expect the best model for our purposes to achieve a lower Doc-MAE as it should better approximate true document-level simplicity labels in aggregate.
Correlation with Human Simplicity Judgments.
We test the effectiveness of the metric by comparing its correlation with two datasets of human simplicity ratings: Simplicity-DA (Alva-Manchego et al., 2021) and Human-Likert (Scialom et al., 2021b).Simplicity-DA contains 600 system outputs, each with 15 ratings and 22 references, whereas Human-Likert contains 100 humanwritten sentence simplifications, each with ∼60 simplicity ratings and 10 references.We use all references when computing the reference-based metrics and consider the average human simplicity rating for each item.
As Simplicity-DA consists of system output simplifications, it naturally contains some sentences that are not fluent or semantically adequate.In such cases, humans would likely give low scores to the simplicity dimension as well (e.g. it is not simple to understand non-fluent text) -this is reflected in the inter-correlation between simplicity and the two other dimensions (Pearsons' r of 0.771 for fluency and 0.758 for adequacy).Thus, we only consider a subset (Simplicity-DA) containing those system outputs with both human fluency and meaning preservation ratings at least 0.3 std.devs above the mean (top ∼30%)4 which allows us to more appropriately consider how well metrics identify simplicity alone.For Human-Likert, the inter-correlation with fluency and meaning preservation are less pronounced, but do still exist (0.736 and 0.370). 5As such, a metric with high correlation on Human-Likert but low correlation on Simplicity-DAmeans it is likely measuring one of the other aspects rather than simplicity itself.

Results
Results on the regression task can be seen in Table 3.We see that although using soft labels ob- viously worsens MAE with respect to the original reading levels, the document-level MAE is improved, suggesting that quantized labels lead to more extreme false negatives under uncertainty, as scores are drawn towards integer values.When treated as a classification task (by rounding predictions) both systems show similar performance (F1).This shows us that SLE is better able to approximate document-level simplicity ratings on average, with little to no drawback at the sentence level (assuming quantized labels were correct).
Correlations with human ratings of simplicity are shown in Table 4.The best metric on Human-Likert is LENS, closely followed by ∆SLE, with other metrics lagging quite far behind.This clearly shows the effectiveness of ∆SLE as it is able to outperform all existing metrics but for LENS, without requiring any references and using a smaller network architecture than LENS and BERTScore.On Simplicity-DA, metrics follow a similar rank order except for certain metrics dropping substantially (SARI, BERTScore, BLEU). 6As Human-Likert still has moderate inter-correlation between evaluation dimensions, the large drops in performance can likely be attributed to these mostly mea-suring semantic similarity with references rather than the actual simplicity.Accounting for the intercorrelation between dimensions has less impact on metrics like ∆SLE and FKGL, confirming the validity of readability-based metrics as potential measures of pure simplicity.
5 Related Work Štajner et al. (2014) attempt to assess each quality dimension of simplifications by training classifiers of two (good, bad) or three (good, medium, bad) classes using existing evaluation metrics as features.However, when the simplicity dimension is considered, performance was poor (Štajner et al., 2016).Later, Martin et al. (2018) were able to slightly improve this after exploring a wide range of features.However, these works do not predict real-valued estimates of simplicity nor have been adopted as evaluation metrics.
Some studies from the automatic readability assessment (ARA) literature use quantized Newsela reading levels as labels to train regression models.Lee and Vajjala (2022) do so in order to predict the readability of full documents, which does not extend to sentence simplification.Yanamoto et al. ( 2022) predict a reading level accuracy within an RL reward for sentence simplification, but do so using the reading levels that were assigned to each document.This too does not transfer well to sentence-level evaluation, given the imprecision and noise introduced by the use of quantized ratings that were assigned at the document level.These approaches have not been applied to the actual evaluation of sentence simplification systems.

Future Directions
In this paper we explore the efficacy of SLE as a measure of raw simplicity or relative simplicity gain (∆SLE).However, given the flexibility of not relying on references, SLE can potentially be used in other ways.For example, one could measure an error with respect to a target simplicity level, l * : This could be useful in the evaluation of controllable simplification systems, which should be able to satisfy simplification requirements of specific user groups or reading levels (Martin et al., 2020;Cripwell et al., 2022;Yanamoto et al., 2022).As it is trained with aggregate document-level accuracy in mind, SLE could also possibly be used to evaluate document simplification -either via averaging sentence scores or using some other aggregation method.

Conclusion
In this paper we presented SLE -a reference-less evaluation metric for sentence simplification that is competitive with or better than the best performing reference-based metrics in terms of correlation to human judgements of simplicity.We reconsider the ability of popular metrics to accurately gauge simplicity when controlling for other factors such as fluency and semantic adequacy, confirming suspicions that many do not measure simplicity directly.We hope this work motivates further investigation into the efficacy of standard simplification evaluation techniques and the proposal of new methodologies.

Figure 1 :
Figure 1: Distribution of (a) original quantized and (b) softened labels for sentences in the SLE training data.

Table 1 :
al., Desirable attributes of popular simplification evaluation metrics -whether they are designed with simplification in mind, use semantic representations, or do not require references.

Table 2 :
Number of sentences sourced from documents of each quantized reading level. ).

Table 3 :
Accuracy results for reading level estimators.Errors are calculated according to the original quantized reading level labels.