BOS at LSCDiscovery: Lexical Substitution for Interpretable Lexical Semantic Change Detection

We propose a solution for the LSCDiscovery shared task on Lexical Semantic Change Detection in Spanish. Our approach is based on generating lexical substitutes that describe old and new senses of a given word. This approach achieves the second best result in sense loss and sense gain detection subtasks. By observing those substitutes that are specific for only one time period, one can understand which senses were obtained or lost. This allows providing more detailed information about semantic change to the user and makes our method interpretable.


Introduction
LSCDiscovery is a shared task on Lexical Semantic Change Detection (LSCD) in Spanish (D. Zamora-Reina et al., 2022). The participants were provided with two corpora in Spanish, corresponding to 1810-1906 and 1994-2020 respectively, and were asked to solve two subtasks. In the first subtask the participants were asked to rank the given list of about 4K words according to the degree of their semantic change. The second subtask required to determine for each given word if its senses occurring in two corpora are different (and optionally, if it has acquired some new senses, and if it has lost any old ones).

Background
Our approach is based on the bag-of-substitutes (BOS) representation of word meaning in context (Başkaya et al., 2013;Arefyev and Zhikov, 2020). Lexical substitutes are those words that can replace a given target word in a given text fragment without making this fragment ungrammatical or substantially changing the meaning of the target word. For ambiguous words, lexical substitutes depend on their meaning expressed in a particular context. For instance, some reasonable substitutes for the word fly in the sentence A noisy fly sat on my shoulder are bug, beetle, butterfly, firefly, insect, etc. But in the sentence We will fly to London they are different: walk, run, bike, etc.
In order to generate lexical substitutes, we employ the XLM-R 1 masked language model (Conneau et al., 2020). This model was pre-trained on 2.5T of data in 100 languages as a masked language model, i.e. it received text fragments with some tokens hidden (replaced with the special <mask> token) and was trained to guess those hidden tokens by their context. This kind of pre-training is partially aligned with the lexical substitution task because the model can predict words compatible with the given context. However, there are no guarantees that these words are similar or related by meaning to the target word. Suitable types of lexical substitutes (e.g., synonyms, hypernyms, cohyponyms) and suitable degree of their similarity to the target word depend on the target task and can be controlled with various techniques explored in . In our solution, we employ the dynamic patterns proposed by Amrami and Goldberg (2018) and explained in 3.2.
Unlike the traditional bag-of-words representation, which contains those words that occur in a text fragment, the BOS representation is built from lexical substitutes. Thus, it better represents the meaning of some specific target word in a given text fragment rather than the whole fragment in general. Clustering of the BOS vectors is a successful approach to solve the Word Sense Induction (WSI) task, i.e. to discover senses of ambiguous words. This approach was explored in many papers, including (Başkaya et al., 2013;Goldberg, 2018, 2019;Arefyev et al., 2019 among others. Also, a substitution-based WSI model was employed to solve the LSCD task in (Arefyev and Zhikov, 2020;Arefyev and Bykov, 2021). However, in our solution we avoid solving the more  Table 1: For the word actas (reports) in ayer recibimos dos actas literales (yesterday we received two verbatim reports), 5 most probable substitutes with 1 or 2 subwords are shown. The patterns with y (and) and incluso (including).
general and probably more difficult WSI task that requires clustering. Instead, we propose methods to directly obtain LSCD predictions from the BOS vectors.  (Giulianelli et al., 2020;Laicher et al., 2021), we will denote this average as the Average Pairwise Distance (APD). Notice that our vector representation is very different from those works.

Model description
For the second subtask, if APD is greater than a certain threshold, we predict that this word has changed its meaning. To determine whether it has acquired new senses and whether any old senses were lost, we propose three different methods based on pairwise distances.

Collected data
For each target word w i , we lemmatize 2 both corpora and retrieve all examples with w i in different grammatical forms. Then we take the same number

Substitute generation
For each example we generate several types of substitutes with different dynamic patterns, post- Table 2: In LS_m1_7, we employ 7 single-subword patterns with y (and), incluso (including) and por ejemplo (for example) with the specified weights. process them and combine together to get a single vector representation. Dynamic patterns are similar to the Hearst patterns by nature (Hearst, 1992). They were proposed in (Amrami and Goldberg, 2018) to obtain from masked language models those substitutes that do not only fit the given context, but also are similar or related to the target word by meaning. For instance, using patterns with the Spanish conjunction y (English: and) we hope to obtain mostly co-hyponyms of the target word, while patterns with the adverb incluso (English: including) shall bias the model towards generating hypernyms or hyponyms, depending on the position of the target word. Table 1 shows some examples. Table 2 lists all dynamic patterns we use. All patterns contain the special token <mask> that XLM-R is asked to recover, and some of them contain the variable T representing the target word. Given a pattern and an example for some target word, first we replace the target word with this pattern, and then replace the variable T (if any) back with the target word. For simplicity, let us consider an example in English. Given the sentence We can fly to London and using the pattern <mask> (and T), we first obtain We can <mask> (and T) to London, and finally have We can <mask> (and fly) to London.
The vocabulary of XLM-R consists of 250K subwords in 100 different languages, which are sometimes whole frequent words, but most often pieces of words. To better describe word meaning, we generate substitutes consisting of different number of subwords. To achieve this, we apply patterns with several <mask> tokens, for instance,<mask><mask> (y T).
To find probable sequences of subwords that could fill the <mask> tokens, we apply a slightly modified greedy decoding strategy. For the leftmost <mask> token, topK = 150 most probable subwords are predicted first. Then for each of those subwords we generate one continuation using greedy decoding. Below we will say that a substitute is not generated for a particular pattern in a particular example if it was not among topK substitutes generated this way. For computational reasons, we generated only substitutes with one or two subwords and did not apply beam search for decoding. Examples of two-subword substitutes are in table 1.

Substitute post-processing and combination
Next, we post-process all substitutes for each example: convert them to lower case, remove all words except for the last one from multi-word substitutes, apply stemming. 4 After post-processing, we sum the probabilities of duplicated substitutes.
For each example, we combine substitutes generated for different patterns by calculating the weighted average of the corresponding probability distributions. In LS_m1 and LS_m2 (Lexical Substitution with one-subword substitutes and twosubword substitutes respectively), for combination we use patterns and weights presented in Tables 2 and 3. The weights were selected based on a few experiments on the development set consisting of 20 words, so these weights are likely suboptimal. It is possible that one of the substitutes is not generated by XLM-R for a certain pattern. In this case, during combination we assume that the corresponding probability is equal to the minimal probability among all substitutes generated for this pattern.  (4) our results # LS_m1_7+APD -0.125 (9) -0.129 (8) our post-evaluation results

BOS vectors
For each target word w i we build 2N i BOS vectors for old and new examples. These vectors are basically bag-of-word vectors built for topK most probable substitutes for each example. Only substitutes that were generated for more than 3% and less than 90% of examples of the target word are taken into account 5 .

Graded Change Discovery
APD (Average Pairwise Distance). After building the BOS vectors, we calculate the cosine distance from each old to each new example, resulting in a matrix of size N i × N i . The APD is calculated by averaging all cells in this matrix. Finally, we sort test words according to their APDs and submit their ranks as the predicted change scores. 6

Binary Change Detection
For the main Binary Change Detection subtask, if the calculated APD is greater than the certain threshold 7 , then we predict that this word has changed its meaning. In this case we also try to determine if it has acquired new senses and if it has lost some old ones (sense loss and sense gain detection subtasks). We try three methods to determine that. 5 We used CountVectorizer from scikit-learn, where min_df = 0.03 was selected in range from 0 to 0.05 with 0.01 step and max_df = 0.9 was selected in range from 0.85 to 1 with 0.01 step. 6 There was a mistake in the original implementation of the ranking procedure. After the competition we fixed it, which significantly improved the results of this method (see table 4 for comparison). 7 threshold = 0.8 was selected on the development set in the range from 0.7 to 0.9 with 0.05 step.

CH, F1
GAIN, F1 LOSS, F1 baselines baseline1 0.537 (9) NaN (8) NaN (6)  baseline2 0.222 (10) 0.211 (7) 0.000 (6) best results of other teams myrachins 0.716 (1) 0.491 (3) 0.688 (1) dteodore 0.709 (2) 0.000 (8) 0.000 (6)   AID (Average Inner Distance). We calculate APDs between only new examples AID 1 and between only old examples AID 2 . If AID 1 > (AID 2 − b 1 ), we predict that a new sense appeared. If AID 2 > (AID 1 − b 2 ), we predict that an old sense is lost. 8 Thus, we assume that a difference in average inner distances for two sets of examples indicates that there is a difference in underlying sets of senses. min. We calculate an N i × N i matrix of pairwise distances from old to new examples and assume that if some new sense appeared, then a new example exists that is far from all old examples. Thus, if there is at least one new example whose minimal distance to the old examples is greater than some threshold 9 , we predict that a new sense appeared. Sense loss is determined symmetrically.
perc. (percentile). This is similar to the previous method, but we calculate the 5th percentile instead of the minimum, i.e. we allow at most 5% of examples from the old corpus to be closer to an example of the new sense from the new corpus than the specified threshold. We assume that this should make the model less sensitive to noisy examples and more stable.

Phase 1: Graded Change Discovery
In this subtask, it was required to rank about 4K target words according to their degree of semantic change (the higher rank, the stronger change). The final quality of ranking was evaluated for 60 hidden words only by the Spearman's correlation with the gold ranks (Bolboaca and Jäntschi, 2006). Table 4 provides the results for the first phase. Our original implementation of the ranking procedure had mistakes in the ranking procedure, so the results are poor. After the competition, we fixed the mistake and obtained the correct results, which are comparable to the 3rd best participant in the leaderboard.
LS_m1_2 and LS_m2_2 differ only in the number of masks in the used patterns. So comparing their scores, we can say that using two-subword substitutes is more preferable than one-subword substitutes. In LS_m2_7 seven patterns are combined compared to two patters in LS_m1_2, this gives a significant improvement despite somewhat arbitrarily selected weights. Developing some principled ways of finding promising dynamic patterns and weights for their combination is a reasonable direction for future work. LS_m1_7 has a slightly higher JSD,SPR score, but its COMPARE,SPR score is lower and it uses a more complex pattern combination than LS_m2_2. A more detailed investigation is presented in Appendix A.

Phase 2: Binary Change Detection
In this subtask the participants were asked to determine if target words have changed their meanings. And if so, how exactly (have acquired and/or have lost senses). Three F1-scores are calculated: Binary Change Detection (CH), Sense Gain Detection (GAIN), Sense Loss Detection (LOSS). Results are presented in Table 5 where we have the 2nd best submission for GAIN and LOSS optional subtasks.
LS_m1_2 + APD and LS_m2_m2 + APD have 0.628 and 0.636 CH,F1 scores respectively, which means that using two-subword substitutes is slightly better than one-subword. But in the case of LS_m1_7 + APD we already get 0.658 CH,F1 resulting in the 4th rank.
Using AID method does result in good GAIN,F1 and LOSS,F1 scores (Table 6). At the same time min and percentile show a better results but they highly depend on used LS patterns, i.e., in the some cases these methods improves only GAIN,F1 or LOSS,F1 scores, but not both of them.

Discriminative substitutes
The main advantage of LS-based models is their interpretability. We can roughly understand word meanings looking at the discriminative substitutes, i.e. the substitutes specific for a particular subset of examples.  Table 7: Discriminative substitutes generated for the <mask> (y T) pattern. The probabilities P (w|M ) and P (w|O) are shown for each substitute. Documentos is 'documents', señal is 'signal', memoria is 'memory' and canal is 'channel'.
From the Table 7 we can see that disco (disc) and satélite (satellite) have acquired new senses as a data storage device and satellite television respectively.

Efficiency
The set of the target words proposed in Phase 1 was supposed to be a challenge for participants due to its size. For 4385 words given we have collected about 777K examples. Generation of substitutes for all examples took 13 GPU-hours and 310 GPUhours for each one-mask and two-mask pattern respectively on V100 GPUs. All other steps took incomparably less time.

Conclusion
We have proposed an interpretable approach to lexical semantic change detection. This approach shows the 2nd best result for sense loss and sense gain detection subtasks. It provides techniques to understand which senses were obtained or lost by a word.

A Substitute analysis
Our models mostly depend on the used LS patterns and ways of their combination. So it is important to make some investigations about them. In this section we study the following questions.
• Which single-subword pattern gives the best results and how these results depend on the number of substitutes generate (topk)?
• Is it better to use single-subword or multisubword substitutes?
• Do brackets and dashes affect the results?  For brevity, we will use M instead of <mask> in the pattern descriptions.
In the follow-ing figures mask position describes the position of the <mask> token. For example, if the pattern is M (y T) / T (y M), mask position=left refers to the pattern M (y T), and mask position=right refers to T (y M). Finally, mask position=combination denotes the combination of these patterns with equal weights.

A.1 Single-subword patterns
In LS_m1_7 we use 7 patterns with different weights, which were selected after only a few experiments on the development set. In this section we study how the results depend on the patterns and try to find simpler and more intuitive ways of the substitute combination. Figures 1 and 2 show JSD,SPR and COMPARE,SPR for different patterns.
It is interesting that in all cases the left patterns give better results than the right ones, except for the incluso-based patterns. Also in all cases the combination averages the results of both patterns, again except for the combination of inclusobased patterns which on the contrary improves the results. A.2 One-subword substitutes vs.
two-subword substitutes We assume that using more masks should improves results because this allows to generate more diverse substitutes. Figure 3 provides comparison of patterns with different number of masks. As we suspect, using T (y MM) pattern gives a much better results than T (y M). However combination of twomask patterns results in just slightly higher score and one-mask pattern M (y T) even outperforms MM (y T).

A.3 Patterns without brackets and dashes
In the patterns discussed above we have extra dashes which were added by mistake and potentially could affect the results, so firstly we remove them from patterns. Also we have assumption that using brackets is not common thing in Spanish so such patterns could spoil generated substitutes and final results. To prove it we decide to compare y-based patterns with and without brackets and dashes.
In the Figures 4 and 5 we can see that in all cases refusal to use brackets and dashes improves our results quite well, especially the right pattern get around 0.1 growth in JSD,SPR and COM-PARE,SPR scores.