MANTIS at TSAR-2022 Shared Task: Improved Unsupervised Lexical Simplification with Pretrained Encoders

In this paper we present our contribution to the TSAR-2022 Shared Task on Lexical Simplification of the EMNLP 2022 Workshop on Text Simplification, Accessibility, and Readability. Our approach builds on and extends the unsupervised lexical simplification system with pretrained encoders (LSBert) system introduced in Qiang et al. (2020) in the following ways: For the subtask of simplification candidate selection, it utilizes a RoBERTa transformer language model and expands the size of the generated candidate list. For subsequent substitution ranking, it introduces a new feature weighting scheme and adopts a candidate filtering method based on textual entailment to maximize semantic similarity between the target word and its simplification. Our best-performing system improves LSBert by 5.9% accuracy and achieves second place out of 33 ranked solutions.


Introduction
Lexical simplification (LS) is a natural language processing (NLP) task that involves automatically reducing the lexical complexity of a given text, while retaining its original meaning (Shardlow, 2014;Paetzold and Specia, 2017b).Since LS has a high potential for social benefit and improving social inclusion for many people, it has attracted increasing attention in the NLP community (Štajner, 2021).LS systems are commonly framed as a pipeline of three main steps (Paetzold and Specia, 2017a): (1) Complex Word Identification (CWI), (2) Substitute Generation (SG), and (3) Substitute Ranking (SR), with CWI often being treated as an independent task.
In this paper, we present our contributions to the English track of the TSAR-2022 Shared Task on LS (Saggion et al., 2022).Focusing on steps (2) and (3) in the pipeline above, the task was defined as follows: Given a sentence containing a complex word, systems should return an ordered list of "simpler" valid substitutes for the complex word in its original context.The list of simpler words (up to a maximum of 10) returned by the system should be ordered by the confidence the system has in its prediction (best predictions first).The ordered list must not contain ties.The task employed a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese.The gold annotations consists of all simpler substitutes suggested by crowdsourced workers and checked for quality by at least one computational linguist who is native speaker of the respective language (for details, see Štajner et al. (2022)).Contributing teams were provided with a small sample with gold standard annotations as a trial dataset.For English, this trial dataset consists of 10 instances of a sentence, a target complex word and a list of substitution candidates.The English test dataset consisted of 373 instances of sentence/complex word pairs.Submission were evaluated in terms of ten performance metrics that fall into three groups: (1) MAP@K (Mean Average Precision@K) for K = 1, 3, 5, 10 candidate words.This metric evaluates a ranked list of predicted substitutes that is matched (relevant) and not matched (irrelevant) terms against the set of the gold-standard annotations for evaluation.(2) Potential@K: K = 1, 3, 5, 10.Potential scores quantify the percentage of instances for which at least one of the substitutions predicted is present in the set of gold annotations and (3) Accuracy@K@top1: K = 1, 2, 3. Accuracy scores represent the ratio of instances where at least one of the K top predicted candidates matches the most frequently suggested synonym/s in the gold list of annotated candidates.

System Description
Our contributions to the TSAR shared task builds on and extends the approach to unsupervised lexical simplification with pretrained encoders -LS-Bert -described in Qiang et al. (2020) et al. (2021).This approach leverages a pretrained transformer language models to generate contextaware simplifications for complex words.The LSBert simplification algorithm addresses two of three principal subtasks of LS: simplification candidate generation and substitution ranking.
Our approach extends LSBert in the following ways: (1) It utilizes a RoBERTa transformer language model for simplification candidate generation and expands the size of the generated candidate list.(2) It introduces new substitution ranking methods that involve (i) a re-weighting of the ranking features used by LSBert and (ii) the adoption of equivalence scores based on textual entailment to maximize semantic similarity between the target word and its simplification.In submissions (runs) 2 and 3, we further explore the utility of crowdsourcing-and corpus-based measure of word prevalence for substitution ranking.The simplification algorithm underlying the three submissions described in this paper is shown in Algorithm 1.In the following we describe the details of simplification candidate generation (2.1), substitution ranking (2.2) and obtaining equivalence scores (2.3).all_ranks ← all_ranks ∪ rank 13: end for 14: tot_rank ← sum(all_ranks) 15: word_list ← sort_ascending(tot_rank) 16: word_list ← postproc(word_list ) 17: return word_list

Simplification Candidate Generation
During candidate generation, for each pair of sentence S and complex word w, the LSBert algorithm first generates new sequence S in which w is masked.The two sentences S and S are then concatenated and fed into a pretrained transformer language model (PTLM) to obtain the probability distribution of the vocabulary that can fill the masked position, p(•|S, S \{w}).The top 10 words from this distribution are considered as the list of simplification candidates.1 Our simplification candidate generation method differs from the one used in LSBert in two ways: (1) the choice of PTLM and (2) the size of the candidate list.Qiang et al. (2021) performed experiments with three BERT models: (i) 768hidden,110 M parameters. (ii) BERTlarge, uncased: 24-layer, 1024-hidden, 16-heads, 340 M parameters, and (iii) BERT-large, uncased, Whole Word Masking (WWM): 24-layer, 1024hidden, 16-heads, 340 M parameters.The results of their experiments indicated that the WWM model obtains the highest accuracy and precision.Here we extended these PTLM-experiments to include RoBERTa models (Liu et al., 2019) and also experimented with the combined use of BERT and RoBERTa to enlarge the list of substitution candidates.The results of our experiments indicated that optimal results are obtained using the RoBERTamd: 12-layer, 768-hidden, 12-heads, 125M parameters.To maximize the chance of obtaining at least ten suitable substitution candidates after rigorous filtering based on semantic criteria (see below), we increased the size of the candidate list generated in this step from 10 to 30 candidates.

Substitution Ranking
In LSBert, candidate substitutions are ranked based on four features each of which is designed to capture one aspect of the suitability of the candidate word to replace the complex word.These features are rank orders of candidate substitutions based on four scores: (1) 'Pretrained LM (PTLM) prediction' (B P T LM (sc), in LSBert, PTLM = Bert) representing the probability derived from PTLM that the candidate substitution word sc presents at the masked position given the rest of a sentence.(2) 'Language model feature' (L P LM (sc)) representing the average loss of the context of sc, w m −m = (w −m , w −m+1 , . . ., w 0 , . . ., w m−1 , w m ), where w 0 = sc.(3) 'Semantic similarity' (S(sc)) expressed as the cosine similarity between the fast-Text vector of the original word and the that of the sc.(4) 'Word frequency' (F (sc)) as estimated from the top 12 million texts from Wikipedia and the Children's Book Test corpus. 2 In LSBert, the rank of a sc, R(sc), is based on an equal weighting of these four features, as shown in equation ( 1) and (2). where and SCS is the set of all substitution candidates.In our three submissions to the shared task, we considered three different strategies to derive the above Score(sc): In the first submission (Man-tis_1), we adapted the ranking method as shown in equation ( 3).c f is the feature weight for feature f and c This ranking method introduces a re-weighting of the features so as to (i) increase the relative importance of the semantic similarity between the target word w and a substitute candidate sc and (ii) decrease the relative importance of the probabilitybased PTLM prediction.With regard to the former, the value of S(sc), corresponding to ranked cosine similarity, was increased by a factor of 3 to penalize candidates with low similarity to the target word.With regard to the latter, we decided to drop the language model feature L P T LM (sc) as its correlation with B P T LM (sc) would yield an up-weighting of the importance assigned to the probability of sc to appear in the masked position.
In the second and third submissions (Mantis_2 and Mantis_3), we experimented with alternative features for substitution ranking: To this end, we first computed lexical complexity scores for the sentences in the trial data for each substitution candidate using 77 indicators (see Table 2 in the appendix).All scores were obtained using an automated text analysis system developed by our group (for its recent applications, see e.We then used each feature to obtain a rank order of substitution candidates and correlated reach ranking with the rank order of substitution candidates provided in the trial data.The top-2 lexical features yielding the largest correlations with the gold standard ranking were selected for substitution ranking for Mantis_2 and Mantis_3, respectively.Both of these lexical features concern word prevalence (WP), i.e. they refer to the number of people who know the word: WP crowd estimates the proportion of the population that knows a given word based on a crowdsourcing study involving over 220,000 people (Brysbaert et al., 2019).WP corp.SDBP is an corpus-derived estimate of the number of books that a word appears in (Johns et al., 2020).The corresponding rankings were obtained as shown in equations ( 4) and ( 5): Apart from these WP-features, the substitution ranking in runs 2 and 3 was determined by a semantic feature, referred to as the 'equivalence score' Eq(sc) (see section 2.3).This score was evoked based on the consideration that semantic similarity measured by cosine similarity of embeddings is not expressive enough (Kim et al., 2016): Any two words that are frequently used in similar contexts will have a low cosine similarity between the embeddings.Thus cosine similarity often fails to recognize antonyms, such as "fast" and "slow".The next section will provide more details on how equivalence score were obtained.

Obtaining Equivalence Scores
Lexical simplification needs to preserve the original meaning of the target word.As cosine similarity between embedding vectors can be too permissive, we introduced a stricter criterion based on textual entailment.To achieve this we utilized a language model explicitly trained to the natural language inference (NLI) task of evaluating logical connections between sentences.The central idea is to compute for each substitute word sc a score that quantifies the textual entailment of the original sentence S and its variant S' that contains sc.Textual entailment is a directional relation between text fragment  that holds whenever the truth of one text fragment follows from another text.The entailing and entailed texts are termed premise (p) and hypothesis (h), respectively.The relation between p and h can be one of entailment, contradictory or neutral (neither entailment nor contradictory).To the extent that p and h mutually entail each other, they are considered equivalent.In this paper, the entailment scores were obtained from the 'roberta-large-mnli' model from the Hugginface transformer library. 3 Roberta-large-mnli is a RoBERTa large model finetuned on the Multi-Genre Natural Language Inference corpus using a masked language modeling objective (Williams et al., 2018).The entailment score is defined as the probability that p entails h: where θ is the parameters of trained robertalarge-mnli.We quantify the degree of equivalence of two sentences (equivalence score) as the product of the entailment scores in both directions.For a given sentence S and the corresponding simplified sentence S , the equivalence score is defined as: Apart from their use in the substitution ranking in Mantis_2 and Mantis_3, equivalent scores were also used in a postprocessing step in Mantis_1: Here the list of substitution candidates was pruned after ranking by removing candidates whose equivalence scores were smaller than the mean equivalence score of all candidates.
3 https://huggingface.co/roberta-large-mnli 3 End-to-end System Performance The official results across seven performance metrics4 are presented in Table 1 in the appendix (for details, see Saggion et al. (2022)).As the performance metrics are strongly intercorrelated (mean correlation across all metrics = 0.920, sd = 0.071, see also Figure 2 in the appendix), we focus our discussion here on the results of one metric from each of the three groups: (1) Accuracy.1,(2) MAP.10 and (3) Potential.10(see Figure 1).Our bestperforming system was 'Mantis_1'.This system reached 2 nd rank on both MAP.10 and Potential.10and 3 rd rank on accuracy.Mantis_1 displayed an improvement over the median performance of +25.56% on accuracy, +24.13% on potential.10and +9.93% MAP.10.It outperformed the LSBert baseline by +5.9% accuracy, +4.38 MAP.10 and 3.49% Potential.10.The two systems whose substitution ranking was based solely on word prevalence and an equivalence score lagged behind the LSBert baseline on two of the performance metrics shown here, suggesting that the improvements of our system over LSBert was mainly due to better substitution ranking, rather than candidate selection.However, Mantis_2 outperformed LSBert on the Potential.10metric, suggesting that the inclusion of word prevalence can be fruitfully employed to improve LS systems.In future work, we intend to explore the role these and additional indicators of lexical sophistication for substitution ranking.

Figure 1 :
Figure 1: Performance ranking based on Accuracy, Mean Average Precision, and Potential scores (k=10).Vertical lines represent the median performance across the 33 submission for each metric.