Controllable Lexical Simplification for English

Fine-tuning Transformer-based approaches have recently shown exciting results on sentence simplification task. However, so far, no research has applied similar approaches to the Lexical Simplification (LS) task. In this paper, we present ConLS, a Controllable Lexical Simplification system fine-tuned with T5 (a Transformer-based model pre-trained with a BERT-style approach and several other tasks). The evaluation results on three datasets (LexMTurk, BenchLS, and NNSeval) have shown that our model performs comparable to LSBert (the current state-of-the-art) and even outperforms it in some cases. We also conducted a detailed comparison on the effectiveness of control tokens to give a clear view of how each token contributes to the model.


Introduction
Lexical Simplification (LS) is a Natural Language Processing task that modifies texts by substituting difficult words with easier words (or phrases) while keeping the original information and meaning (Shardlow, 2014).Table 1 shows an example of a lexical simplification.On the other hand, Syntactic Simplification (SS) is a similar task that reduces the syntactic complexity of a text.Both LS and SS tasks can be seen as sub-tasks of the broader task of Automatic Text Simplification (Saggion, 2017), which reduces both the lexical and syntactic complexity of texts.Lexical Simplification systems (Paetzold and Specia, 2017a) usually have components for 1) identification of complex words; 2) generation of substitution words; 3) selection of the substitutes that can fit in the context; 4) ranking substitutes by their simplicity; and 5) morphological and contextual adaptation (if necessary).The systems evaluated in this paper do not perform complex word identification.We use datasets that already had a complex word tagged for each instance.Moreover, we do not address the morphological and context adaptation task because neural-based language models usually return a correct inflected candidate.

Complex Sentence:
The Hush Sound is currently on hiatus.Simplified Sentence: The Hush Sound is currently on break.
Table 1: A lexical simplification example taken from the LexMTurk dataset (Horn et al., 2014) with the complex word and the substitute word in bold.
The contributions of this paper are: • To the best of our knowledge, we are the first to introduce a controllable mechanism for LS and to fine-tune a Transformer-based model for LS.1 • We have conducted an extensive evaluation of several metrics.This allows us to better understand the system when applied to realworld scenarios.
The rest of the paper is organized as follows: in Section 2, we describe related work on Lexical Simplification focusing on neural-based systems.Section 3 presents the ConLS approach.Section 4 describes the evaluation metrics and presents the experimental results.Section 5 discusses the results of the experiments, while Section 6 concludes the paper and presents future work.

Related Work
Early Lexical Simplification approaches with unsupervised models used: Latent Words Language Models (De Belder and Moens, 2010), Wikipediabased models/rules (Biran et al., 2011;Yatskar et al., 2010;Horn et al., 2014) and distributional lexical semantics (Glavaš and Štajner, 2015).(Paetzold and Specia, 2017b) started the use of neural networks for the task combined with a retrofitted context-aware word embedding model.(Qiang et al., 2020(Qiang et al., , 2021) ) presented LSBert, a Lexical Simplification system that uses a pretrained BERT (Devlin et al., 2019) model for English to generate substitution candidates.LSBert has two main phases: 1) Substitution Generation with the BERT Masked Language Model, and 2) Substitution Filtering and Ranking with several features: BERT prediction order, a BERT language model, PPDB database, corpus-based word frequency, and FastText similarity.Martin et al. (2020) presented ACCESS a controllable Text Simplification system based on Sequence-to-Sequence models.This system allows explicit control of simplification conditions such as length, amount of paraphrasing, lexical complexity, and syntactic complexity.ACCESS achieved SOTA results in Text Simplification benchmarks on the WikiLarge test set.Later on, Martin et al. (2022) introduced MUSS (an extended version) by fine-tuning BART (Lewis et al., 2019) with AC-CESS, and the results were improved.In addition, Sheang and Saggion (2021) took a similar approach, adding another control token (number of words) and fine-tuning it with T5 (Raffel et al., 2020).

System Description
Following recent works of Martin et al. (2020), Martin et al. (2022), Sheang and Saggion (2021), and Štajner et al. (2022b), we are inspired to apply a similar approach in lexical simplification task.Specifically, our model is based on Sheang and Saggion (2021), a model originally developed for sentence simplification2 .We propose a controllable mechanism for LS because we believe that the embedded token values extracted from training data could give additional information to the model about the relations between the source and the target word; so that at inference, we could define different token values that fulfill our objectives, which in this case is to find the best candidates.In the following paragraphs, we describe all the details about each token and the reason why they are chosen.
Word Length (WL) is the character length ratio between the complex word and the target word.It is the number of characters of the target word divided by the number of characters of the complex word.Based on our analysis of the training dataset (TSAR-EN), 65.71% of the time complex word is longer than the best candidate, 21.30% the complex word is shorter than the best candidate, and 12.99% both are the same length.
Word Rank (WR) is the inverse frequency of the target word divided by that of the complex word.The inverse frequency order is extracted from the FastText pre-trained model.Based on our analysis of the TSAR-EN dataset, 85.45% of the time, the complex word has a lower frequency than the best candidate.Therefore, this token is a good indicator to help guide the model to predict simpler candidates.
Candidate Ranking (CR) is the ranking order extracted from the training data.The values are given to candidates by the ranking order.E.g., the best-ranking candidate is given the value 1.00, the second 0.75, the third 0.50, the fourth 0.25, and starting from the fifth, it is given 0.00.We used only five different values to avoid overloading the model, as the training data is relatively small.In addition, the rationale behind using these values is that we want the model to learn candidates ranking from data through the training process rather than injecting additional information or doing postprocessing.

Experiments
In our experiments, we compare our model with the current state-of-the-art model LSBert (Qiang et al., 2020).We used the original LSBert configurations and resources, and we made the following changes to have a detailed comparison with our model.By default, LSBert returns only a single best candidate for each complex word, so we made the changes to return the 10 best-ranked candidates.We changed the number of BERT mask selections from 10 to 15 so that after removing duplicate candidates, we still have around 10 candidates.Moreover, we filtered out all the candidates that were equal to the complex word.Due to the fact that all the used datasets have gold annotated simpler substitutions in all instances, we could assume that returning the complex word would be incorrect.

Datasets
This subsection describes all the Lexical Simplification datasets for English that we used in our experiments.We used LexMTurk (Horn et al., 2014), BenchLS3 (Paetzold and Specia, 2016a), and NN-Seval4 (Paetzold and Specia, 2016b) for testing and TSAR-EN (Štajner et al., 2022a) dataset for training and validation.LexMTurk has 500 sentences that were obtained from Wikipedia.This dataset contains the marked complex words and their replacements suggested by 50 English-speaking annotators.The BenchLS dataset is a union of the LSeval (De Belder and Moens, 2012) and LexM-Turk datasets in which spelling and inflection errors were automatically corrected.The NNSeval dataset is a filtered version of the BenchLS adapted to evaluate LS for non-native English speakers.

Sentence
European Union foreign ministers agreed Monday to impose fresh sanctions on Syria as a U.N.backed peace plan -along with all other diplomatic efforts -has yet to stop the carnage that mounts every day.
TSAR-EN dataset has 386 instances with 25 gold-annotated substitutions.Table 2 shows an example.The instances and their target complex words were extracted from the Complex Word Identification shared task 2018 (Yimam et al., 2018).The instances were annotated using Amazon's Mechanical Turk by 25 annotators.A native English annotator reviewed all suggestions.

Evaluation Metrics
We evaluated the systems with several metrics that could take into account the results for different numbers of K candidates (from 1 up to 10).The metrics used are the following: • Accuracy@1: is the ratio of instances with the top-ranked candidate in the gold standard list of annotated candidates.
• Accuracy@K@top1: The ratio of instances where at least one of the top K predicted candidates matches the most frequently suggested synonym/s5 in the gold list of annotated candidates.
• Potential@K: the percentage of instances for which at least one of the top K substitutes predicted is present in the set of gold annotations.
• Mean Average Precision@K (MAP@K): This metric evaluates the relevance and ranking of the top K predicted substitutes.
• Precision@K: the percentage of top K generated candidates that are in the gold standard.
• Recall@K: the percentage of gold-standard substitutions that are included in the top K generated substitutions.

Experimental Setup
In this section, we describe how the data are preprocessed, the training details of the model, and finally, the generation of candidates.

Data Preprocessing
For each instance, we have a sentence, a complex word, and a list of ranked candidates.We compute all the ratios and the ranking, then prepend it to the source sentence.We also use special tokens [T] and [/T] to mark the boundary of the complex word in the source sentence and the simple word in the target sentence.Moreover, these special tokens help us identify the candidates during the inference.Table 3 shows an example of source and target sentences embedded with token values and boundary tokens.

Training
For our experiments, we fine-tuned T5-Large on the TSAR-EN dataset.We also compared the differences of T5 models; the results are in Table 6.We split the dataset to 90% for training and 10% for validation.This 10% validation set is also used  in the token values search at the inference, as described in the following section.For the training data, we preprocessed by extracting and adding control tokens to the source sentence along with the boundary tokens to the complex word and substitute word, as shown in Table 3.We set the maximum sequence length (number of tokens) to 128, as all our datasets contain less than 128 in tokens length.We used Optuna (Akiba et al., 2019) for hyper-parameters search.For more details about the implementation and hyperparameters, please check Appendix A.

Inference
First, we performed token values search on the validation set that maximizes the Accuracy@1@top1 score using Optuna (Akiba et al., 2019).We searched the values ranging between 0.5 and 1.25; at each iteration, we changed the value by 0.05.We searched only WL and WR, whereas for CR, we set it to 1.00 because we already knew that the bestranking candidates were given the value of 1.00.Then we kept these values fixed for all sentences at the inference.Finally, at the inference, we set the beam search to 15 and the number of return sequences to 15 so that after filtering out some duplicate candidates, the remaining would be around 10.The ranking order of the candidates is chosen from the return orders of sequences produced by the model.

Results and Discussion
In Table 4 we present the results for the metrics: Accuracy@1, Accuracy@k@Top1, and Potential@K.
In Table 5 we present the results for the metrics: MAP@K, Precision@K, and Recall@K.The results of ConLS presented here are based T5-Large.
Our experiments show that the modified LSBert had improved its Accuracy@1 metric results with respect to the ones seen in the original LSBert paper (Qiang et al., 2021): Accuracy@1 has improved from 79.20 to 84.80 for LexMTurk, from 61.60 to 67.59 for BenchLS, and from 43.60 to 44.76 for NNSeval.On the other hand, for the Accuracy@1 metric the ConLS system does not improve the results of the modified LSBert system but improves the results of the original LSBert for the LexMturk and BenchLS datasets.The results of the Accuracy@k@Top1 metric show that the modified LSBert achieves better results at K={1,2} and the ConLS achieves better results at K={3,4,5} for all datasets.This indicates that with more candidates allowed (3, 4, and 5 candidates) the ConLS is able to generate more instances with candidates within the top-1(s) gold annotated substitution(s) with respect to LSBert.The results of the Potential@K metric show these facts: 1) in LexMturk and BenchLS, the ConLS is outperforming LSBert gradually and increasingly from k=3 to k=10; 2) in NNSeval, ConLS improves the potential of LS-Bert only at K=10.For the MAP@K metric, we show that ConLS is able to improve the results of the metric at K={4,5,10} in all the datasets with respect to the modified LSBert.Finally, the results of the Precision@K and Recall@K metrics show the same pattern: 1) for LexMTurk, ConLS outperforms the LSBert in all K={3,5,10}; 2) for BenchLS and NNSEval, ConLS outperforms the LSBert only in K={5,10}.
We also conducted a comparison on the effect of different T5 models trained with TSAR-EN and evaluated with LexMTurk.Table 6 shows that the T5-Large model performs a lot better than the T5-Base and the T5-Small models in all metrics (Accuracy@1, Accuracy@k@Top1).Therefore, we believe that the performance of our model would improve if we could go with larger model, for example, T5-3b or T5-11b.We have tried with T5-3b model, but unfortunately it was unable to fit into our GPU memory (NVidia RTX 3090) even though we had set the batch size to as small as one.
To evaluate the effectiveness of the control tokens, we conducted further experiments with different set of combinations.We trained and evaluated each set of tokens using T5-Large with TSAR-EN Dataset System ACC@1 ACC@k@Top1 Potential@k @1 @2 @3 @4 @5 @2 @3 @4 @5 @10 Table 4: The results of LSBert and ConLS for the metrics: Accuracy@1, Accuracy@k@Top1, and Potential@K.

BenchLS
T5 Model ACC@1 ACC@k@Top1 @1 @2 @3 T5-Small 23.  for training and LexMTurk for evaluation.The results on Table 7 have shown that the model trained with no tokens performs lower than the model with all tokens in all metrics, especially for the Accuracy@1@Top1 metric, the model with all tokens perform +2 points higher.Moreover, the all tokens model performs better than all other models in all metrics.This indicates that each token contributes to the selection and the ranking of the candidates that leads to better performance.

Conclusions and Future Work
This paper presents ConLS, the first approach for Controllable Lexical Simplification.The paper also describes the evaluation of LSBert and ConLS for English with the LexMTurk, BenchLS, and NNSeval datasets for testing and the TSAR-EN dataset for training.The results of our evaluation show that the modified LSBert improves the Accuracy@1 metric results with respect to the ones seen in the original LSBert paper in all three datasets.ConLS Tokens ACC@1 ACC@k@Top1 @1 @2 @3 No Tokens 79.20

Limitations
We describe in this Section the limitation of our work.The most probable limiting features are: • The size of training dataset: the TSAR-EN dataset has 386 instances.Obviously, training with datasets with a large number of instances would be recommended to create better models.
• Quality of the training dataset: although during the creation of the TSAR-EN dataset, it was inspected and the unsuitable substitutions were removed and replaced with suitable ones (Štajner et al., 2022a), it is possible that the dataset quality could be improved by including substitutions not reported by the annotators.
• Quality of the testing datasets: it is also possible that these datasets could be improved by including substitutions not reported by the annotators.
• Successful adaptation to other languages: we could have possible difficulties in achieving similar adaptations and results in non-English languages due to the difficulties in availability of similar resources for other languages and specifically for low-resource languages.

A Implementation Details
Our implementation is based on Huggingface Transformers (Wolf et al., 2020) and Pytorchlightning6 .We trained the model using T5-Large for 8 epochs.For the optimization, we used AdamW (Loshchilov and Hutter, 2019) optimizer with the learning rate of 1e-5 and adam epsilon of 1e-8.We set the batch size of 8 for both training and testing.For the inference, we used beam search with the size of 15 to get around 10 candidates after filtering out duplicate candidates or the candidates that are the same as the complex word.We trained the model on a machine with an NVidia RTX 3090, Intel core i9 CPU, with 32G of RAM.It took around 2 hours for the whole process: the training and the evaluation on the three datasets.

Source:
<CR_1.00> <WL_0.54><WR_0.90>The Obama administration has seen what The New York Times calls an [T]unprecedented[/T] crackdown on leaks of government secrets.Target: The Obama administration has seen what The New York Times calls an [T]unusual[/T] crackdown on leaks of government secrets.

Table 6 :
The results of ConLS trained all tokens using different T5 models.The models were trained with TSAR-EN and evaluated with LexMTurk.

Table 7 :
The results of ConLS trained with different set of tokens.Each model was trained with TSAR-EN and evaluated with LexMTurk.