Unsupervised Lexical Simplification with Context Augmentation

We propose a new unsupervised lexical simplification method that uses only monolingual data and pre-trained language models. Given a target word and its context, our method generates substitutes based on the target context and also additional contexts sampled from monolingual data. We conduct experiments in English, Portuguese, and Spanish on the TSAR-2022 shared task, and show that our model substantially outperforms other unsupervised systems across all languages. We also establish a new state-of-the-art by ensembling our model with GPT-3.5. Lastly, we evaluate our model on the SWORDS lexical substitution data set, achieving a state-of-the-art result.


Introduction
Lexical simplification is the task of replacing a word in context with an easier term without changing its core meaning, to make text easier to read for non-technical audiences, non-native speakers, or people with cognitive disabilities (e.g.dyslexia).
One common approach (Li et al., 2022;Qiang et al., 2020a,b) is to use a masked language model (MLM) such as BERT (Devlin et al., 2019) and predict substitutes via word prediction over the masked target word.However, one limitation is that it critically relies on the target context being discriminative of the semantics of the target word, which is not always the case.Given this, we propose a new unsupervised method that performs context augmentation.Specifically, we sample sentences that contain the target word from monolingual data, and identify substitutes that can replace the word in the target and sampled sentences.Based on our experiments in English, Portuguese, and Spanish over the TSAR-2022 shared task (Saggion et al., 2022), we show that our model comfortably outperforms other unsupervised models.We also es-tablish a new state-of-the-art by ensembling our model with InstructGPT (Ouyang et al., 2022), and further demonstrate the effectiveness of the method over the related task of lexical substitution.

Method
We propose a fully unsupervised model using pretrained language models (without fine-tuning) and monolingual data.Given the target word x and context c x , our model generates substitutes for x based on not only c x but also augmented contexts sampled from monolingual data.

Generation Based on the Target Context
To generate substitutes of x given c x , we extend the lexical substitution approach of Wada et al. (2022), 2 which generates an aptness score S(y|x, c x ) for each word y ∈ V as follows:3 S(y|x, c t ) = max k cos(f k (y), f (x, c x )), 4 (1) where cos denotes cosine similarity; f (x, c x ) denotes the contextualised embedding of x in c x ;5 and f k (y) denotes the decontextualised embeddings of y, represented by K-clustered embeddings: f1 (y), ...f K (y), which are obtained by first sampling 300 sentences that contain y from monolingual corpora, and clustering the contextualised embeddings of y using K-means (K = 4).For each cluster k, f k (y) is calculated as , where C y,k denotes sentences that contain y and belong to the cluster k.
While this is the state-of-the-art unsupervised method on the SWORDS lexical substitution data set (Lee et al., 2021), one major limitation is that S(y|x, c x ) in Eqn.(1) heavily depends on f (x, c x ), suggesting that if the meaning of x is not well captured by f (x, c x ), it may retrieve erroneous substitutes.In fact, this is often the case in lexical simplification, where x is usually a rare word and gets segmented into subword tokens (in which case f (x, c x ) is represented by the average of the subword embeddings).For instance, given the target word bole, the model retrieves toe as one of the top-10 substitutes, likely because the segmented bol ##e and to ##e share the same subword ##e, suggesting that words that share the same token(s) tend to have similar representations regardless of their semantic similarity.To mitigate this, when x is tokenised into multiple tokens, we add the term αcos(E(x), E(y)) to Eqn. (1), where α is a scalar value and E(x) and E(y) are pre-trained static embeddings of x and y; we use fastText (Bojanowski et al., 2017) for this purpose.Since static embeddings are pre-trained with a large vocabulary size (e.g.200k words), they tend to represent the semantics of rare words better than averaging embeddings of their (suboptimally tokenised) subwords. 6 We tune α on the dev set and set it to 0.2, 0.7 and 0.6 for English, Spanish, and Portuguese, respectively.For embedding model f , we use DeBERTa-V3 (He et al., 2023) for English, and monolingual BERT models for Spanish and Portuguese (Cañete et al., 2020;Souza et al., 2020).We extract the M 1 = 15 words with the highest scores.

Generation with Context Augmentation
Following previous work on lexical simplification (Li et al., 2022;Qiang et al., 2020a,b), we also generate substitutes based on MLM prediction, by replacing x with a mask token and performing word prediction.In this approach, the predictions are not affected by the embedding quality or tokenisation of x.However, if we rely solely on the target context c x as in previous work, the model has difficulty predicting substitutes when the context is not very specific; e.g.The bole was cut into pieces. 7To 6 While fastText similarly makes use of character n-grams to represent words, it also trains a unique representation for each word, which is not shared with any other words (e.g. the word embedding of her is constructed by its character n-grams plus the special sequence <her>, where < and > correspond to the beginning and end of token).We also tried using GloVe (Pennington et al., 2014) instead of fastText in English, and observed comparable results.
7 While previous work concatenates the masked sentence with the original (unmasked) sentence, it does not completely solve this problem.
address this problem, we perform context augmentation using monolingual data.Following the process of generating decontextualised embeddings in Wada et al. (2022), we sample 300 sentences that contain x from monolingual corpora and cluster them using K-means (K = 4).8For each sentence in cluster k, we replace x with a mask token and feed it into T5 (Raffel et al., 2020) in English, or mT5 (Xue et al., 2021) in Spanish and Portuguese, to generate 20 substitutes using beam search, 9 and retain those that contain only one word (which can comprise multiple subwords).Then, within each cluster k, we aggregate the substitutes across all sentences c ′ x ∈ C x,k and extract the M 2 = 25 mostgenerated words.For each substitute candidate y, we calculate the score S(y|x, c x ) as: where I(y|c ′ x )) denotes a function that returns 1 if y is generated by T5 given the context c ′ x , and 0 otherwise; and w k denotes the number of substitutes in the cluster k that overlap with the M 1 words generated from the target context c x in Section 2.1. 10ere, w k roughly corresponds to the semantic relevance of the cluster k to c x ; e.g. if w k = 0, the candidates in the cluster k would reflect a different sense of x from the one in the target context and hence is not considered.11Intuitively, this scoring function favours substitutes that appear frequently in sampled contexts, weighted by how semantically relevant the substitute's cluster is to the original context -we will show its effectiveness with an example in Section 4. Finally, we retrieve the M 2 words with the highest scores and combine them with the M 1 candidates generated from c x .

Reranking
Given M 1 +M 2 candidates 12 (potentially with overlap), we rerank them using four different metrics: (i) embedding similarity; (ii) LM perplexity; (iii) word frequency; and (iv) S(y|x, c x ) in Eqn.(2).For the first metric, we use the reranking method proposed by Wada et al. (2022).For each candidate y, they replace x in c x with y and calculate the cosine similarity between the contextualised embeddings f (x, c x ) and f (y, c x ). 13 For the LM perplexity metric, we replace x in c x with a mask token and calculate the probability of generating y using T5; this score helps measure the syntactic fit of y in c x .The third metric corresponds to the frequency of y in monolingual data,14 which serves as a proxy for lexical simplicity.Finally, the last metric measures how often y can substitute x in the augmented contexts.Using each metric, we obtain four independent rankings R 1 , R 2 , R 3 , R 4 and calculate their weighted sum: r 1 R 1 +r 2 R 2 +r 3 R 3 +r 4 R 4 , which is then sorted in ascending order to produce the final ranking.We tune the weights {r 1 , r 2 , r 3 , r 4 } based on the dev set for each language; {5, 1, 1, 1}, {3, 1, 0, 3}, and {3, 1, 0, 2} for English, Spanish and Portuguese, respectively.

Data and Evaluation
We experiment on the TSAR-2022 shared task on multilingual lexical simplification (Saggion et al., 2022;Štajner et al., 2022;Ferrés and Saggion, 2022;North et al., 2022b).We use its trial data as our dev set (about 10 instances per language) and evaluate models on the test set, which contains about 370 instances per language.Evaluation is according to four metrics: Accuracy@1 = % of instances for which the top-1 substitute matches one of the gold candidates; Accuracy@k@top1 = % of instances where one of the top-k substitutes matches the top-1 gold label; Potential@k = % of instances where at least one of the top-k substitutes is included in the gold candidates; and MAP@k = the mean average precision of the top-k candidates.

Baselines
We compare our method against several systems submitted to the shared task.In all languages, UniHD (Aumiller and Gertz, 2022) is by far the best system across all metrics.It prompts GPT-3.5 (text-davinici-002, a.k.a. InstructGPT: Brown et al. (2020); Ouyang et al. (2022)) to provide ten easier alternatives for the target word x given c x , in two variants: zero-shot and ensemble.The former generates substitutes based on the target word and context only, whereas the latter ensembles the predictions with six different prompts and temperatures; among them, four prompts include one or two question-answer pairs retrieved from the dev set to allow InstructGPT to perform in-context learning (as detailed in Table 8 in Appendix).While the ensemble model achieves the best results across all languages, it is not exactly comparable with the other systems as InstructGPT is supervised on various tasks with human feedback.As such, we also include the second-best systems (which differ for each language) as baselines, namely: MANTIS (Li et al., 2022), GMU-WLV (North et al., 2022a), and PresiUniv (Whistely et al., 2022).We also include the shared task baseline LSBert (Qiang et al., 2020a,b).All of these systems are based on pretrained MLMs like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), and three of them also employ static word embeddings similarly to our model.Lastly, we also include Wada et al. (2022) with and without fastText in our baselines.

Results
Table 1 presents the results in English, Portuguese, and Spanish.The first five rows are based on InstructGPT: the first two show the zeroshot/ensemble performance reported in Aumiller and Gertz (2022), and the next two show the results when we replace text-davinci-002 with gpt-3.5-turbo.The results show that gpt-3.5-turbosubstantially outperforms text-davinci-002.The last row for InstructGPT shows the result when we prompt the model to provide simplified alternatives for x without the target context (shown as "w/o context"), which indicates that the model performs very well even without access to the target context.This result demonstrates that the model has memorised lists of synonyms, and that most instances are not very context-dependent; we will return to discuss this in Appendix B.
The next five rows (marked "Unsupervised") show the performance of the unsupervised models, including ours.Our model clearly outperforms the other systems across all languages.In English, it even outperforms the zero-shot GPT-3.5-turbo in Potential@3 (94.1 vs. 92.8)despite the substantial differences between these models in terms of the model size (i.e.435M and 800M parameters are used for DeBERTa-V3 and T5, respectively, and Model ACC@1 ACC@3@Top1 MAP@3 Potential@3 175B parameters for GPT-3.5) and the language resources to use (i.e.our model employs monolingual data only while GPT-3.5 is instructed with human feedback).The strong performance in English is largely owing to the use of better LMs (DeBERTa-V3 and T5) compared to the ones used in Spanish and Portuguese (BERT and mT5), as evidenced by the substantial performance drop when we use BERT and mT5 for English. 15The comparison of Wada et al. (2022) with and without fastText demonstrates the effectiveness of including static embeddings, especially in Portuguese and Spanish.This is because the vocabulary size of Portuguese/Spanish BERT is much smaller than that of DeBERTa-V3 (30/31k vs. 128k), and a large number of target words are segmented into subwords and embedded poorly.Lastly, we try ensembling: (1) the six rankings from GPT-3.5-turbo-ens; (2) the word frequency ranking (which we find boosts performance); and (3) the final ranking of OURS.
The last two rows show the performance for (1) + (2) vs. (1) + (2) + (3).The ensemble of eight rankings including our method establishes a new state-of-the-art across all languages in most metrics, suggesting that our model is somewhat complementary to InstructGPT.In Appendix, we provide more detailed results (Table 7) and error analysis (Appendix B). 15 The exact scores are 69.4,58.7, 45.0, and 88.2 for ACC@1, ACC@3@Top1, MAP@3, and Potential@3, resp.

Model
Lenient Strict

Experiment on Lexical Substitution
We also evaluate our model on the English lexical substitution task over the SWORDS data set (Lee et al., 2021).For lexical substitution, there is no restriction on lexical simplicity, so we drop the word frequency feature in reranking (i.e.set r 3 to 0). 16 Table 2 shows the results in the lenient and strict settings.17F a and F c denote the F1 scores given two different sets of gold labels a and c, where a ⊂ c.In the strict setting, our model outperforms the best unsupervised model of Wada et al. (2022) and also the (non-LLM) state-of-theart semi-supervised model of Qiang et al. (2023), which employs BLEURT (Sellam et al., 2020) and a sentence-paraphrasing model, both of which are Method ACC@1 ACC@k@Top1 MAP@k Potential@k k=1 k=2 k=3 k=3 k=5 k=10 k=3 k=5 k=10 pre-trained on labelled data.Lastly, we also ensemble our model with the six outputs of GPT-3.5, and establish a new state-of-the-art.

Ablation Study
We perform ablation studies on the effect of clustering in Eqn.
(2), and present the results in Table 3; the scores are averaged over English, Portuguese, and Spanish."Soft Retrieval" indicates the performance when we take the weighted sum of the clusters as we propose in Eqn.(2); "Hard Retrieval" denotes when we set w k = 1 for the closest cluster and w k = 0 otherwise; and "No Clustering" denotes when we set w k = 1 for all the clusters, which is equivalent to performing no clustering.
The table shows that our proposed method performs the best overall, albeit with a small margin over "No Clustering".In fact, this is more or less expected since the majority of target words are used in their predominant senses (as evidenced by the strong performance of GPT-3.5 w/o context in Ta-ble 1), in which case, retrieving the most-generated words across all sampled sentences would suffice to produce good substitutes.Table 4 shows one example where clustering plays a crucial role (more predictions plus another example are shown in Table 5 in Appendix A).In this example, the target word elite is used as a noun meaning "a select group", but all clusters except for Cluster 2 produce the substitutes for elite in adjectival senses.Therefore, if we naïvely aggregate the words across all clusters ("No Clustering"), we end up retrieving adjectives such as top and professional, whereas our weighted-sum approach ("Soft Retrieval") successfully extracts good substitutes from the relevant clusters.

Related Work
Recent lexical simplification models are based on generating substitute candidates using MLM prediction and reranking, using features such as fast-Text embedding similarities and word frequency (Qiang et al., 2020a;Li et al., 2022).Some also use external tools or resources such as POS taggers or paraphrase databases (Qiang et al., 2020b;Whistely et al., 2022).However, Aumiller and Gertz (2022) show that GPT-3.5 substantially outperforms previous models on the TSAR-2022 shared task.Similar to this work, our prior work (Wada et al., 2023) samples sentences from monolingual corpora and use them to paraphrase multiword expressions with literal expressions (composed of 1 or 2 words).

Conclusion
We propose a new unsupervised lexical simplification method with context augmentation.We show that our model outperforms previous unsupervised methods, and by combining our model with In-structGPT, we achieve a new state-of-the-art for lexical simplification and substitution.

Limitations
One limitation of our model is that it performs context augmentation using monolingual data, which incurs additional time and computational cost.However, if we construct a comprehensive list of complex words X and sample sentences containing x ∈ X in advance, we can pre-compute the generation counts: (2) without considering the target context c x (which is required to calculate w k only).Therefore, we can still generate substitutes in an online manner during inference as long as the target word x is included in X.
Compared to the InstructGPT baseline, our model critically relies on word embeddings and MLM prediction, both of which hinge on word co-occurrence statistics.This sometimes results in wrongly predicting antonyms of the target word as substitutes due to the similarity of their surrounding contexts (e.g.famed for infamous; more specific examples and error types are shown in Appendix B).On the other hand, InstructGPT benefits from supervision with human feedback and also makes use of lexical knowledge provided in various forms of texts during pre-training, including dictionaries, thesauri, and web discussions about meanings of words. 18This is clearly one of the reasons why InstuctGPT substantially outperforms the other unsupervised systems, including ours; in fact, we find that it performs extremely well even without access to the target context (Table 1), motivating a call for including more context-sensitive instances in lexical substitution/simplification data sets; more discussions follow in Appendix B.

A Impact of Clustering
Table 5 shows two examples where clustering plays a crucial role; the first instance was partially shown in Table 4 and discussed in Section 4. In the second instance, Soft Retrieval retrieves substitutes that are relevant to the meaning of the target word extend given this particular context.Without clustering, on the other hand, we get a mixed bag of words that represent different senses of extend (with more frequent senses ranked higher).In both cases, GPT-3.5 produces context-aware substitutes, although this is not always the case (as we discuss in the next section), and some of the candidates do not fit naturally in context (e.g.ruling class is predicted as the best substitute for elite used in ruling elite).

B Error Analysis
Table 6 shows some examples of outputs from our model and GPT-3.5-turbo-ens on the lexical simplification and substitution tasks.In the first example, our model generates near-synonyms of the target word infamous with both negative and positive connotative meaning (e.g.notorious and famed, respectively), while GPT-3.5 generates negativeconnotation words like disgraceful only.This is because our model heavily relies on word representations and assigns high scores to those words that appear in similar contexts.Similarly, our model incorrectly predicts notoriously as a substitute for infamous despite its ungrammaticality in this context, likely due to the similarity of their embeddings; we surmise that using a part-of-speech tagger would help alleviate this problem. 19In the second instance, our model is overly affected by the target context and generates words that often appear in similar contexts to the target context but have different semantics from the target word strategic, e.g.global and economic.In comparison, GPT-3.5 generates more correct substitutes; however, some of them do not sound quite natural in this context (e.g.calculated).This is more evident in the third instance, where some of the "gold" substitutes are semantically marked when put into this context -the original phrase in defiance of calls means "opposition against calls", but in resistance of calls and in rebellion of calls (both predicted by GPT-3.5 and included in the gold labels) do not sound natural.These examples suggest that human annotators are sometimes oblivious to the context and consider substitutes largely based on the out-ofcontext similarities of the words,20 which motivates a call for revisiting the exact goal of lexical simplification/substitution and its annotation schemes, e.g.whether the words should be annotated based on the similarity of lexical semantics or acceptability in context.The same concern is also raised by the strong performance of GPT-3.5 without access to the target context (Table 1).
In the last three examples, which are taken from the SWORDS lexical substitution data set, the sensitivity of our model to context works favourably and results in better substitutes than GPT-3.5, which, in those examples, generates substitutes without considering the context very much (in contrast to the examples in Table 6).For these instances, we also tried using the ChatGPT web interface (the free version, accessed in May 2023) and found that its outputs are highly stochastic even with the same prompt:21 sometimes it returns substitutes that are quite similar to the ones generated by GPT-3.5-turbo, and other times it generates more context-aware and accurate substitutes (e.g. business for service and probably/presumably for likely).As such, further investigation is needed to see how carefully the model pays attention to the context (given different prompts), and how well it works for instances that require a profound understanding of the context.(Wada et al. (2022); Section 2.1) and the augmented contexts (Section 2.2). "OURS" denotes the substitutes obtained by reranking the candidates of Wada et al. (2022) and "Soft Retrieval".The values for w k denote the weights for each cluster in Eqn.(2), which correspond to the number of the shared words between the top-15 words from Wada et al. (2022) and the top-25 words from each cluster.The words included in the gold labels are boldfaced.

Table 1 :
The results on lexical simplification."-zero/ens" denote the zero-shot/ensemble models, and "w/o context" indicates the performance without access to the target context.The best scores among InstructGPT and unsupervised models are underlined, and the overall best scores are boldfaced.

Table 2 :
The results on lexical substitution.

Table 3 :
The results on the lexical simplification task using different cluster-retrieval methods in Eqn.(2).The scores are averaged over English, Portuguese, and Spanish."Soft Retrieval" indicates our original method proposed in Eqn.(2), and "Hard Retrieval" denotes when we set w k = 1 for the closest cluster and w k = 0 otherwise.The last row indicates when we set w k = 1 for all clusters, which is equivalent to performing no clustering.

Table 4 :
Wada et al. (2022))enerated based on the target context(Wada et al. (2022)) and the augmented contexts (Section 2.2).The values for w k denote the weights for each cluster in Eqn.(2)."OURS"reranks the candidates ofWada et al. (2022)and "Soft Retrieval".The words included in the gold labels are boldfaced.