GeneSis: A Generative Approach to Substitutes in Context

The lexical substitution task aims at generating a list of suitable replacements for a target word in context, ideally keeping the meaning of the modified text unchanged. While its usage has increased in recent years, the paucity of annotated data prevents the finetuning of neural models on the task, hindering the full fruition of recently introduced powerful architectures such as language models. Furthermore, lexical substitution is usually evaluated in a framework that is strictly bound to a limited vocabulary, making it impossible to credit appropriate, but out-of-vocabulary, substitutes. To assess these issues, we proposed GeneSis (Generating Substitutes in contexts), the first generative approach to lexical substitution. Thanks to a seq2seq model, we generate substitutes for a word according to the context it appears in, attaining state-of-the-art results on different benchmarks. Moreover, our approach allows silver data to be produced for further improving the performances of lexical substitution systems. Along with an extensive analysis of GeneSis results, we also present a human evaluation of the generated substitutes in order to assess their quality. We release the fine-tuned models, the generated datasets, and the code to reproduce the experiments at https://github.com/SapienzaNLP/genesis.


Introduction
The lexical substitution task (McCarthy and Navigli, 2009) requires a system to provide adequate replacements for a target word in a given context. Through the years, two lexical substitution variants have been proposed, i.e., candidates ranking and substitutes prediction (Melamud et al., 2015). While the former aims at ranking a list of predefined candidate substitutes for a word in a given context, the latter is more challenging, requiring a system to output a sorted list of replacements without any predefined substitutes inventory. Although it is not explicitly required by either of the two tasks, a good substitution system is expected to capture the semantics of its input and implicitly perform a soft disambiguation. For example, denoting bright as target word in the context sentence "She is a bright student", we expect a good substitution system to provide a set of substitutes closer to {intelligent, clever, smart} than to {luminous, clear, light}. Thanks to this implicit disambiguation capability, lexical substitution has shown its usefulness in several scenarios, such as word sense induction (Başkaya et al., 2013;Amrami and Goldberg, 2018;Arefyev et al., 2019), data augmentation (Jia et al., 2019;Arefyev et al., 2020), word sense disambiguation (Hou et al., 2020) and semantic role labeling (Bingel et al., 2018).
However, despite having been employed in numerous downstream tasks, the lexical substitution task still presents unresolved issues that need to be addressed. First, the shortage of large-scale corpora annotated with the expected substitutes hinders the use of supervised techniques, including powerful Transformer-based language models, thus leaving the task in a possibly sub-optimal setting. Second, the evaluation metrics provided for the task are bound to the test vocabulary, hence they fail to capture the quality of substitutes outside the vocabulary; moreover, the vocabulary is usually small and often biased by the particular linguistic style and background of the annotators who developed the datasets. 1 In this paper, we focus on substitutes prediction and address the above problems by proposing GEN-ESIS, a generative approach to lexical substitution. We find that not only is this novel approach effec-tive when tested on the lexical substitution task, but also that it can be applied to generate substitutes from raw text, enabling the effortless construction of large-scale silver data. Moreover, we conduct an annotation task to analyze the results of our model and to validate out-of-vocabulary generations.
Our contribution is threefold: • A novel generative approach to lexical substitution that outperforms the state of the art.
• An automated method to produce high-quality silver data for lexical substitution.
• An annotation task to evaluate out-ofvocabulary generations.

Related Work
Through the years, several approaches have been developed to tackle the lexical substitution tasks, but, to the best of our knowledge, ours is the first attempt to apply a generative approach to it. In what follows, we first review the principal approaches and resources for the lexical substitution tasks, and then provide a brief overview of generative methods across different fields.
Lexical Substitution Approaches Since its presentation by McCarthy and Navigli (2007), a variety of different approaches have been explored to produce the substitutes that better fit the context. Earlier methods made use of external knowledge bases such as WordNet (Miller, 1995) to extract possible substitutes and construct delexicalized features (Szarvas et al., 2013), or they employed word embeddings to represent both the target and the substitutes in their context and rank them through ad-hoc metrics (Melamud et al., 2015(Melamud et al., , 2016. However, the recent spread of pre-trained language models has deeply reshaped approaches to lexical substitution, standardizing the use of contextualized word representations to provide a context-aware distribution over the output vocabulary. The first work in this direction was that of Garí Soler et al. (2019), where ELMo (Peters et al., 2018) embeddings are used to rank substitutes according to their cosine similarity to the target. In Zhou et al. (2019), instead, the input context is represented through a BERT model (Devlin et al., 2019). The authors partially mask the target word in its context, in order to obtain a representation that includes a faded target information; this representation is then used to obtain a probability distribution over the BERT output vocabulary that is not biased towards the target. Finally, the top scoring substitutes are reranked with a measure of similarity that takes into account both the cosine similarity and the relative attention scores between the target and the substitute. In a similar vein, Arefyev et al. (2020) proposed an extensive comparison of how several pretrained language models perform on the task, also injecting information about the target from word embeddings or rephrasing the input with dynamic patterns. Their best performing method produces an XLNet (Yang et al., 2019) contextualized embedding of the target word combined with static frequency information about proximity between target and substitute. This combined representation is then used to obtain a ranking of substitutes from the XLNet vocabulary that is further refined with postprocessing.
Despite the improvement in performances that large language models brought to the task, these methods work in a potentially sub-optimal setting, since they are used as feature extractors and are not finetuned, due to the paucity of large-scale annotated data (Garí Soler et al., 2019).
Lexical Substitution Resources The first dataset released was the Lexical Substitution Task (LST), proposed as test set for the task by McCarthy and Navigli (2007). It contains 2010 sentences with a single target word per sentence, including around 200 distinct targets. Each instance is associated with several substitutes that were chosen by five native English speaker annotators. The small coverage of LST led to the creation of the Turk bootstrap Word Sense Inventory (Biemann, 2012, TWSI), a first attempt to collect a large-scale dataset. The author deployed Amazon Mechanical Turk to annotate 25K sentences from Wikipedia, which, however, only cover noun targets. To overcome this shortcoming, Kremer et al. (2014) proposed Concept In Context (CoInCo), a dataset of 2474 sentences covering 3874 distinct targets with diverse part-of-speech tags. Each sentence has one or more targets, for a total of 15k instances annotated through Amazon Mechanical Turk.
Generative Approaches Generative pre-trained language models such as GPT (Radford et al., 2018) have shown to be highly effective in Natural Language Generation, catching the attention of the research community. Indeed, pre-trained models such as BART (Lewis et al., 2020) suit a wide range of NLP applications. Thanks to the flexibility of seq2seq learning, these models can be easily adapted to different tasks, including sequence and token classification or sequence generation, inter alia. Interestingly, generative models have also been employed in tasks that are not usually formulated as sequence-to-sequence learning; for example, there have been effective applications of seq2seq architectures to definition modeling (Bevilacqua et al., 2020), cross-lingual Abstract Meaning Representation (Blloshmi et al., 2020), end-to-end Semantic Role Labeling (Blloshmi et al., 2021 and Semantic Parsing (Procopio et al., 2021;Bevilacqua et al., 2021a).
Inspired by these successful applications of generative approaches we here propose applying, for the first time, a generative seq2seq model to the lexical substitution task. Differently from previous approaches in the field, we finetune a pre-trained model to produce substitutes starting from a word in its context. Moreover, our method can be used to generate silver data for the lexical substitution task, reducing the lack of annotated data.

GENESIS
The task of substitutes prediction requires finding replacements for a target word in a context that ideally do not modify the overall meaning of that context. More formally, given a target word w t occurring in a context sentence x = w 1 , . . . , w n , a substitution system has to assemble a ranked list s of possible replacements for w t according to its context x. Consider as an example the context The roses are bright. (1) where the target w t = bright appears. As output of our system, we expect a generated list of substitutes, such as s = [vivid, luminous, shining]. We tackle the lexical substitution task with a two-stage process: first, we use a seq2seq model that takes as input both the context and the target, and generates several possible lists of substitutes (substitutes generation, Section 3.1); second, we process the substitutes collected with the first step to obtain the final, ranked list (substitutes ranking, Section 3.2). The whole process is described in Figure 1. Throughout the paper, we will consider each target word w t to be univocally associated with its part-of-speech (POS) tag. To improve readability we discard POS tags from the notation.

Substitutes Generation
We assume to have a seq2seq model M that, given a context x where a target word w t occurs, is able to generate a sequence of substitutes s by modeling the probability where s 0 is a special start token. In order to structure both the target and the context as a single input sequence for M , we identify the target in its context by surrounding it with two special tokens. Formally, for a target word w t in x we define the input m wt,x as: Thus, example (1) is structured as: The roses are <t>bright</t>.
The expected output s, instead, is a commaseparated sequence s = s 1 , . . . , s q where each word s i is a possible substitute for w t in x. At training time, we provide the model with a sequence of gold substitutesŝ =ŝ 1 , . . . ,ŝ k also structured as a comma-separated list. Thus, we can train M by minimizing the cross-entropy loss between the gold and the generated sequences. At inference time, for each input sequence m wt,x we actually produce several substitute sequences s 1 , . . . , s b obtained with beam-search decoding (Figure 1(a)).

Substitutes Ranking
Once the model has produced a set of substitute sequences, we collect the unique substitutes and rank them according to the context.

Collection and Filtering
First, we create the set W of words 2 that occur across the sequences s 1 , . . . , s b . W could contain inappropriate substitutes, such as the target itself or words that are closely related to the target but have a different part of speech (Figure 1(b)). To provide a cleaned list of substitutes, for each target word we define a possible output vocabulary and remove from W all the words that are not part of it, including the target itself (Figure 1(b), bold)). We denote this reduced set as W clean . The building of the output vocabulary is detailed in Section 4. Figure 1: A schematic representation of GENESIS. The input is fed to a seq2seq model that produces several sequences of substitutes (a); the substitutes are collected and filtered according to an output vocabulary (b). We create a new sentence for each substitute by using it as replacement for the target in the original context (c); finally, we use the contextualized representations of the substitutes to rank them by similarity with the target (d).
Contextualization We denote with j the index of the target word w t in context x, and the contextualized representation of w t in x as: where N LM (x) is the representation of x obtained through an arbitrary neural language model. Then, for each valid substitute w c ∈ W clean , we obtain a modified context x wc by replacing the target word w t with w c (Figure 1(c)). Now we can obtain a contextualized representation of each substitute as: Ranking To produce the final ranking of the substitutes (Figure 1(d)) we compute the cosine similarity of the target word vector with respect to that of each substitute, i.e., cossim(v x,j , v xw c ,j ) ∀w c ∈ W clean , and order the substitutes by their descending cosine similarity with the target.

Vocabulary Definition
One of the challenges in the lexical substitution task is the lack of a predefined substitute inventory, i.e., for each target word we lack a reference list of possible replacements. Importantly, with GEN-ESIS we can produce approximately any word in the English vocabulary as substitute, although standard evaluation benchmarks consider valid only the words in the test vocabulary. To reach a suitable trade-off between the generative power of the model and the necessity of a fair evaluation, we define an output vocabulary that the model has to stick to, i.e., we discard any generated word that is not contained in it (Section 3.2).
To build our vocabulary, we take advantage of WordNet 3.0, a widely-used lexicographic resource structured as a graph. Each WordNet node is a synset, i.e., a set of different lexicalizations with the same meaning and POS, while edges represent semantic relations between synsets, such as hyponymy and hypernymy. For example, one of the synsets for the adjective bright is {bright, brilliant, vivid}, that is connected through the similar-to semantic relation to the synset {colorful, colourful}. For each target w t we compute a set of synsets D wt that defines the output vocabulary. We initialize D wt as the set of synsets S wt where w t appears. 3 Then, for each s wt ∈ S wt we expand D wt by collecting all the neighbors N (s wt ) of s wt ; finally, for each neighbor n that is connected to s wt through a hyponymy, hypernymy, similar-to or see-also relation, we add all the neighbors of n to D wt . We define as possible substitutes for w t the union of all the lexicalizations appearing in D wt . This procedure, visualized in Figure 2, builds a vocabulary that covers all the senses enumerated in WordNet for a given target, defining a reduced range of available substitutions that is still challenging for the task. To provide a quantification of the coverage of the output vocabulary, we report that, when computed for the LST targets, it includes 25 842 distinct substitute words, while the Figure 2: A visual example of how the vocabulary is constructed. We consider s w , one of the two synsets where the noun wine appears (purple oval). First, we consider all its neighbors (orange ovals), then for each neighbor n of s w connected through hypernymy, hyponymy, similar-to and see-also relations (double orange oval), we include all the neighbors of n (green ovals). The neighbors of synsets connected through different relations to s w are discarded (grey ovals). original test set has 3154 possible substitutes. Their intersection covers 2013 words.

Dataset Generation
GENESIS is able to generate substitutes starting from a word in its context. Thus, starting from a source dataset C of target words in context, we can exploit GENESIS to produce ranked lists of substitutes and, associating the generated substitutes with the targets, obtain silver datasets for the lexical substitution task. To this end, first, we finetune GENESIS on a gold dataset for the lexical substitution task; then, we give as input to the finetuned model the corpus C, generating as output a list of replacements for each input instance. The input instances, associated with the generated substitutes, constitute the silver corpus. To ensure the quality of the generated substitutes, we apply a similarity threshold λ on the ranking step of GENESIS (cf. Section 3), keeping only the substitutes whose similarity to the target is higher than λ. As source dataset C we exploit SemCor (Miller et al., 1993), a manually annotated corpus where instances are sense-tagged according to the WordNet sense inventory 4 . While it is typically used as a training corpus for English Word Sense Disambiguation (WSD), as we show, its manually-curated sense distribution is also beneficial for lexical substitution. Indeed, having a frequency of target words that

Experimental Setup
In this section, we specify the setting used to tackle the lexical substitution task.
Model We use BART (Lewis et al., 2020) as seq2seq model, trained through the RAdam optimiser (lr 10 −5 ); we train it for a maximum of 100 epochs, with early stopping and patience set to 2 epochs. The input is fed to the model in batches of up to 600 tokens. To obtain the contextualized representations in Equations (2) and (3) we use the average of the last four 5 hidden layers of BERT large cased. Both BART and BERT are used through the HuggingFace (Wolf et al., 2020) implementations.
Datasets We finetune BART on the concatenation of CoInCo and TWSI. Indeed, the former is originally distributed without training split, with only test and dev sets released; the latter, instead, contains only nouns, so it is not suitable for training alone. Thus, we concatenate the two datasets and produce new train and dev splits by randomly reserving 30% of the target contexts for the dev set CT D and the remaining 70% as training split CT T . As test set we use LST, the dataset originally released for the SemEval-2007 task. As regards the generated datasets, we denote with GENSEMCOR n the dataset obtained by generating substitutes for a sample of n contexts randomly drawn from Sem-Cor. Starting from the size of CT T , i.e., 37k sentences, we generate four different samples by doubling the dataset size each time, with each sample including all the sentences in the previous one, i.e GENSEMCOR 37k ⊂ GENSEMCOR 74k and so on. The final dataset, that includes all the SemCor sentences for which at least one substitute has been generated, is identified by GENSEMCOR. We highlight that, when training on GENSEMCOR datasets, we use only silver data, without concatenating gold corpora. The properties of the gold and generated datasets are summarised in Table 1.

Evaluation Metrics
We evaluate the performance of our model using the metrics originally proposed for the task (McCarthy and Navigli, 2009), i.e., best and out-of-ten (oot), together with their mode variations. The best metric allows a system to produce as many substitutes as are considered useful, by dividing the credit for each correct guess by the number of produced guesses. The best substitute should ideally be provided first. For each test instance, we provide the scorer only with the first substitute from the ranking detailed in Section 3.2. The oot metric evaluates up to ten candidates that have all the same relevance for the target, without dividing the credit of each correct guess by the number of produced guesses. In this case, we provide the scorer with the first ten substitutes as ranked by GENESIS. The mode variations of the best and oot metrics evaluate only the subset of the test set where a mode exist, i.e., where a majority of the annotators selected a single substitute as the best replacement. The formalization of the metrics employed is detailed in the supplementary materials, Section A. In addition to the standard metrics, we follow Arefyev et al. (2019) and also report p@1, p@3 and r@10.
Fallback Strategy When evaluating a system on the oot metrics, there is no advantage in providing less than ten substitutes. For this reason, whenever the procedure described in Section 3 results in less than ten substitutes, we apply a two-stage fallback strategy. First, we include in the substitutes all those words generated that were discarded when cutting on the output vocabulary. Second, if the list still has less than ten candidates, we extract the substitutes from the vocabulary that are not produced by the model, rank them according to their cosine similarity with the target (cf. Section 3.2) and extend the sequence produced until it reaches ten substitutes.

Parameter Selection
GENESIS has several features that can be personalized, from the model configuration to the generation parameters. With the aim of obtaining the best-performing setting for the lexical substitution task, we conduct an extensive tuning of GENESIS configuration, testing how each parameter affects the results on the dev dataset (CT D ). Here we briefly describe the best-performing settings, while we report the results for each variation of the parameters in the supplementary material (Sections B, C).

Model Parameters
We experiment with different values of dropout, encoder layer dropout and decoder layer dropout. We investigate how the variation of each parameter influences the performances by exploring values in the range [0, 0.6].
When training on the CT T dataset, the best performing setup uses dropout = 0.5, encoder layer dropout = 0.2 and decoder layer dropout = 0.6. This setting is used to perform all the experiments on the CT T dataset and to generate the datasets from SemCor. Then, a new selection of parameters is made on the GENSEMCOR 37k dataset, resulting in a new configuration with dropout = 0.1, encoder layer dropout = 0.6 and decoder layer dropout = 0.2. This configuration is used for the experiments on the GENSEMCOR datasets.
Generation Parameters Several decoding strategies are available for seq2seq models. We experiment with beam sizes and check whether the use of sampling is beneficial for the task. The optimal configuration has beam size 50 and no sampling.

Dataset Generation Parameters
The generated substitutes are filtered through the similarity threshold λ. We tune it experimenting with the values in [0.5, 0.7, 0.8], with the best performing dataset obtained with 0.7.

Experiments
Once all the parameters have been tuned, we train GENESIS on CT T , cut the substitutes generated according to the output vocabulary, apply the fallback strategy and test on the LST dataset.
Baseline Using a predefined vocabulary limits the possible outputs of our model; therefore, to assess whether the performances of GENESIS are mainly influenced by the restricted vocabulary, we  Table 2: Results on the lexical substitution task of GENESIS trained on the CT T dataset (third block) and on GENSEMCOR (fourth block). We compare GENESIS with the two latest approaches to the task (first block) and to a baseline (second block). -and -indicate that the output vocabulary cut and fallback strategy are discarded, respectively. For all the metrics, the higher the better.
devise a baseline that for each target word ranks by cosine similarity all the substitutes contained in the vocabulary built for the target, deploying the contextualization detailed in Section 3.

Comparison Systems
We choose as comparison systems the two most recent approaches to the task, i.e., the BERT-based system proposed by Zhou et al. (2019) and the best-performing solution presented by Arefyev et al. (2020), i.e., an XLNetbased model enhanced with the injection of specific embedding information about the target word. These two models achieve the currently highest reported results on the task. In these approaches the language models are used in a feature-based approach, i.e., they are not finetuned for the task. As already noted by Arefyev et al. (2020), both models output a probability distribution over a BPE-based vocabulary, making it tricky to reconstruct words at inference time. GENESIS, instead, overcomes this limit by relying on the decoding strategy of the generative model.

Results
We report our results in Table 2. The baseline (second block) performs poorly when it comes to predicting the most appropriate substitute (best, p@1, p@3), while it is quite strong in evaluating the top ten substitutes (oot, r@10). This is somehow to be expected: the average number of substitutes in the test set is four (cf . Table 1), hence, there is a good chance that the ten substitutes in WordNet that are closest to the target include the gold ones. As regards the results obtained with GENESIS, in each configuration we report the average of five runs with their standard deviation.
First, we inspect the performances of GENESIS without output vocabulary and without fallback strategy (GENESIS -). The generative approach alone is noisy, showing performances that are lower than the state of the art in any metric. Indeed, when adding the cut on the output vocabulary (GENESIS -), the scores increase on best, best-mode, p@1 and p@3, reaching performances that are higher than the previous state of the art on best and p@3. At the same time, though, reducing the substitutes to the output vocabulary leads to the production of less than ten substitutes, thus decreasing the recall scores, as shown by the drop on r@10 and oot. This is further confirmed by the improvement on the oot and r@10 metrics given by the use of the fallback strategy (GENESIS), i.e., when using the complete system and always providing ten substitutes. In the fourth block of the table we present the results obtained when finetuning BART over the generated datasets. We start with GENSEMCOR 37k , that is comparable in size with CT T . In this case, the silver dataset performs better than the gold one, with results that are better than the state of the art on p@3, best and best-mode, besides being way more stable, as shown by the reduced variance across the metrics. We believe this improved behavior is due to the wider variety of targets (and consequently substitutes) to be found in the generated datasets (cf. Table 1), which helps the model to generalize more effectively. Increasing the size of the sample considered to 148k helps improve the results, achieving state-of-the-art performances on five metrics out of seven 6 . With 148k sentences the system seems to reach a stable point, and keeping on adding sentences does not bring any additional useful information to the model.

Qualitative Evaluation
Our quantitative analysis shows that the substitutes produced by GENESIS are good enough to outperform the previous approaches to the task. The evaluation setting, though, is inherently limited to a fixed vocabulary 7 . Hence, we devise an annotation task to assess the quality of generated substitutes, investigating whether GENESIS is able to generate substitutes that, even when not appearing in the gold standard, are judged good replacements by human annotators.

Annotation Task
We set up a test where an annotator is provided with a target word in context and a set of substitutes that are equally distributed among the gold and the generated ones. The annotator is required to select, if there are any, all the substitutes that are not suitable replacements for the target in the given context. We select three annotators with certified proficiency in English and previous experience in linguistic annotation tasks and present them with a sample of 322 test instances drawn from the LST dataset 8 . The annotators are asked to select the inappropriate substitutes from an anonymized shuffled set of three gold and three generated substitutes, obtained with GENESIS trained on CT T . For all the instances, the gold substitutes do not appear in the generated ones and vice versa. The annotation guidelines are reported in the supplementary material (Section D).

Inter-Annotator Agreement
Since each annotator may select more than one substitute, we measure the inter-annotator agreement (IAA) using Kraemer's κ coefficient (Kraemer, 1980), an extension of the better known Cohen's κ (Cohen, 1960) that allows multiple answers to be provided by annotators. We follow Landis and Koch (1977) to interpret κ values in the range (0.4, 0.6] as moderate agreement, values in (0.6, 0.8] as substantial agreement and those in (0.8, 1.0) as almost perfect agreement. Annotations are usually considered reliable if their IAA agreement is equal or higher than 7 We recall from Section 4 that the vocabulary of LST has slightly more than 3000 words. 8 The sample size is significant with respect to the source dataset with confidence level of 95% and a margin error of ± 5. The annotation interface was developed through Label Studio https://github.com/heartexlabs/ label-studio#try-out-label-studio.

(Eugenio and Glass, 2004).
Results As expected, the percentage of bad substitutes is higher in the generated dataset than in the manually produced gold, with 21% of the generated replacements considered inappropriate, versus the 13% discarded from the gold dataset, with an interannotator agreement of 0.71. The high percentage of accepted substitutes among the generated ones reflects the good quality of the replacements provided by GENESIS, confirming the validity of the approach. The results on the gold set, instead, raise some questions on its completeness and on the validity of an evaluation setting entirely dependent on such a restricted vocabulary. Indeed, more than 10% of gold substitutes are considered inappropriate by the annotators, and 40% of the discarded substitutes are gold. Moreover, we recall that, for each instance given to the annotators, the gold and the generated substitutes are disjoint, thus meaning that all the generated substitutes accepted by the annotators (79% of the ones proposed) are missing from the gold but still considered as suitable replacements.
To give a deeper insight into the incompleteness of the gold dataset, in Table 3 we provide an example of substitutes generated by GENESIS compared with the gold ones. GENESIS is able to provide a richer variety of appropriate substitutes compared to the gold, which lacks several valid substitutes. GENESIS shows its effectiveness in particular when the target is an adjective (e.g. tremendous, bright); with nouns and verbs, it still manages to provide additional good substitutes in comparison to the gold (e.g. rest, skip), but it shows shorter generations, leading to less substitutes. On the adverbs, instead, it sometimes fails to capture the semantics of the target, producing replacements that are not appropriate for the context (e.g. late) or that do not fit syntactically in the sentence (e.g earlier).

Conclusions
In this paper we presented GENESIS, the first generative approach to lexical substitution. The method is simple but versatile: by finetuning a seq2seq model and post-processing its output we are able to generate appropriate substitutes for target words in contexts. Testing GENESIS on the lexical substitution task, we show performances that surpass the state of the art on several measures. At the same time, our approach can be used to produce large-scale silver data, which, when used as train-

Context
Gold Substitutes Generated Substitutes I think this idea has tremendous promise. If you wish to collect your robes earlier you should contact the above number to arrange collection.
beforehand, sooner, prior to that, by then, before previously, before He was bright and independent and proud.
intelligent, clever sharp, enthusiastic, intelligent, talented Let your child pick one bug to glue on the lid. insect fly, insectoid, critter, insect, creature Table 3: An excerpt of the GENESIS output for LST sentences when training on GENSEMCOR 74k , compared with gold substitutes. The generated substitutes reported do not include those added through the fallback strategy.
ing corpora for GENESIS, lead to outperformance over the state of the art on five out of seven metrics. Moreover, large-scale datasets could possibly be deployed to finetune lighter models for the task. Finally, we conduct an annotation task to evaluate the quality of generated substitutes, which results in recognizing 79% of the proposed replacements as good substitutes and also highlights some weaknesses of the current evaluation setting, in that it is strictly bound to an incomplete output vocabulary. As future work, we plan to extend GENESIS to other languages for which the lexical substitution task has been proposed, such as Italian (Toral, 2009) and German (Miller et al., 2015). Moreover, we will investigate how the substitutes produced can be deployed in lexical-semantic tasks such as WSD (Bevilacqua et al., 2021b) or Lexical Simplification (Paetzold and Specia, 2017

A Metrics
We measured the performances on the task with the standard metrics for lexical substitution, i.e., best and oot.
Preliminaries We define T as the set of test instances and H as the set of annotators for the test set. Then, A is the set of instances in T for which the system provides at least one substitute. For the item i ∈ A we identify the set of substitutes provided by the system as a i , while h i represents the set of responses for the item i provided by annotator h ∈ H. Finally, for each i we compute the multiset union H i for all h i for each annotator; each unique type res in H i will have an associated frequency f req res for the number of times it appears in H i . For example, let us consider an item for happy and assume the annotators had supplied answers as follows: annotator id responses 1 glad, merry 2 glad 3 cheerful, glad 4 merry 5 jovial then H i would be [glad glad glad merry merry cheerful jovial]. The res with associated frequencies would be glad 3, merry 2, cheerful 1, jovial 1.
As regards the mode variations, we define as the mode m i the most frequent response for instance i ∈ T , if it exists. The sets where this mode exists are T M and AM , respectively, for the gold substitutes and the system ones.
best and best-mode Defining P b and R b as best precision and best recall respectively, we formulate best as where As regards the mode variation, we modify precision and recall as respectively, where bg is the best guess in the list of substitutes provided by the system. Then, bestmode is computed as in Equation 4. oot and oot-mode In this case, we define P o and R o as oot precision and oot recall, respectively. Then we compute oot as where while in the mode variation precision and recall are slightly modified as respectively. Once again, oot-mode can be computed by following Equation 9.

B Parameter Selection
As regards the model parameters, we explored how dropout, encoder layer dropout and decoder layer dropout affect the performances of the model. We found two different groups of better performing parameters when training on CT T and on GENSEM-COR 37k . In both cases, the variation of the results was measured on the CT D set. To maintain a feasible number of experiments, we set one parameter at a time, setting the dropout first, then the encoder layer dropout and finally the decoder layer  dropout. In Figures 4, 5 and 6 there are the results of each experiment for the tuning of parameters for the models trained on CT T , while Figures 7, 8 and 9 report the performances when training on GENSEMCOR 37k . Often there is no single value for which the model performs uniformly better on all the metrics. In these cases, we tried to select the values that maximised more than one metric, or those that provided a higher improvement.

C Generation Parameters
We compared the results obtained on CT D after finetuning GENESIS on CT T , without any kind of filtering on the output vocabulary and without fallback strategy, in order to evaluate how each decoding parameter directly affects the generation quality. As evaluation metrics, we considered only the metrics that affect the whole dataset, i.e., we excluded the mode variations from our analysis. As generation parameters we experimented with beam size and top-k sampling. We compared the results for k = 5, 10 with those obtained without sampling, i.e., always picking the most probable    one. In all the three cases, we used beam size 5 and postprocessed the generations of each beam as described in Section 3. We can see in Figure  10 that sampling increases the variety of the generated sequences, and consequently the precision's scores, while sticking to the most probable candidate results in a higher recall and oot score, as we can see in the graph. Considering that there is no configuration that achieves the best results on all the metrics, and the increase in memory and time requirements to keep track of the k higher ranked words, we decided not to use sampling at decoding time. As regards beam size, we compared the results obtained when using 5, 15, 25 or 50 beams, described in Figure 11. As expected, generating more sequences results in a higher variety of words generated, thus leading to higher oot and r@10. At the same time, a broader generation may imply "dirtier" substitutes, with words that are close to the target but are not appropriate replacements, slightly decreasing best and p@k scores.

D Annotators Guidelines
For the annotation task, we provided each annotator with a set of instances comprising a context with a single target word (in bold) and six possible substitutes.
The annotator had to select all the inappropriate substitutes, sticking to the following guidelines: Figure 10: The results on the dev set when not using sampling (left) and when using it with top-5 (center) and top-10 (right) most probable elements in the output distribution. 1. A substitute is wrong if it is an inflection of the target (1).
2. A substitute is wrong if its replacement modifies the meaning of the sentence (2, 6).
3. A substitute is wrong it its replacement in the context results in a wrong structure of the sentence (6).
4. A substitute is wrong if it has an inflected form that is different from that of the target (5).
5. A substitute is correct if it is in its base form and not in the same inflection as that of the target (3, 4).