BOS at SemEval-2020 Task 1: Word Sense Induction via Lexical Substitution for Lexical Semantic Change Detection

SemEval-2020 Task 1 is devoted to detection of changes in word meaning over time. The first subtask raises a question if a particular word has acquired or lost any of its senses during the given time period. The second subtask requires estimating the change in frequencies of the word senses. We have submitted two solutions for both subtasks. The first solution performs word sense induction (WSI) first, then makes the decision based on the induced word senses. We extend the existing WSI method based on clustering of lexical substitutes generated with neural language models and adapt it to the task. The second solution exploits a well-known approach to semantic change detection, that includes building word2vec SGNS vectors, aligning them with Orthogonal Procrustes and calculating cosine distance between resulting vectors. While WSI-based solution performs better in Subtask 1, which requires binary decisions, the second solution outperforms it in Subtask 2 and obtains the 3rd best result in this subtask.


Introduction
Lexical semantic change detection (LSCD) is a problem of detecting changes in word meaning over time. SemEval-2020 Task 1  suggests two variations of this problem. The first one (Subtask 1) requires detection of words that have changed the set of their senses, either acquiring a new sense or loosing an old one between two given time periods. For this subtask we extend and adapt the state-of-the-art WSI method based on lexical substitution with neural language models proposed in (Amrami and Goldberg, 2019). Instead of the English BERT employed by the original method, we use the recently introduced masked language model XLM-R (Conneau et al., 2019), which was trained on all languages considered in the task. We adapt this model to the task datasets by additionally finetuning it to accept lemmatized texts as input and produce lemmatized substitutes. We also experiment with different dynamic patterns and their combinations, as well as multitoken substitutes generation. For Subtask 2 we obtained better results with another approach based on SGNS word embeddings and Orthogonal Procrustes alignment.

Related work
In this section we describe those works, which our solutions are directly based on. Please, refer to  for a more detailed overview of the task and alternative approaches.
Subtask 1 raises the question if the set of senses of a particular word has changed. This task would be simple to solve, if we could identify word senses and corresponding word occurrences first. WSI is a task of clustering occurrences of an ambiguous word in accordance to its senses. One of our solutions employs a substitution-based approach (Baskaya et al., 2013;Amrami and Goldberg, 2018;Arefyev et al., 2019), which exploits lexical substitutes to distinguish word senses. In the recently proposed state-of-the-art implementation of this approach (Amrami and Goldberg, 2019), for each occurrence of an ambiguous word lexical substitutes (i.e. words that can replace the ambiguous word in a given context) are generated using BERT masked language model (MLM) (Devlin et al., 2018). Then substitute vectors are built, which are TF-IDF weighted bag-of-words vectors based not on the words in context, but on the generated substitutes. Finally, these vectors are clustered using agglomerative clustering and the resulting clusters are taken as induced senses.
XLM-R (Conneau et al., 2019) is an MLM trained similarly to the multilingual version of BERT on texts in 100 languages (including four languages considered in the shared task), but having 3x more parameters, 2x larger vocabulary and trained not only on Wikipedia, but also on Common Crawl dataset, which increased training data for low-resourced languages by orders of magnitude resulting in more than 2TB of data in total. XLM-R has shown better performance than multilingual BERT in different NLP tasks. Based on these advantages, we have decided to employ XLM-R for substitutes generation in our solution.
Subtask 2 raises the question of how much the frequency distribution of senses has changed for a particular word. This task is well explored and can be solved without explicit WSI by exploiting implicit features like the change in frequencies of word contexts. The surveys like (Kutuzov et al., 2018), (Tahmasebi et al., 2018) and  present the comparison of different approaches to this task. Most of these approaches, including (Dubossarsky et al., 2019) (Hamilton et al., 2016, imply using Skip-gram Negative Sampling (SGNS) word representations and computing cosine distances between them. We employed one of such methods as our baseline solution.

SGNS+OP+CD
As a baseline solution we used SGNS vectors with Orthogonal Procrustes alignment and cosine distance (SGNS+OP+CD), since this combination is known to have strong performance for a task similar to Subtask 2 (Schlechtweg et al., 2019) 1 . We did not tune hyperparameters, instead we used one of the configurations listed in the paper: window size=5, k=5, t=0.001, dimensions=300, minCount=0, iters=5. To solve task 1, we predicted positive class if the cosine distance was above a certain threshold, which was set to 0.5, because no information about class proportions was available. After the competition we have analysed the effect of the threshold value on the results, see figure 1.

BOS+AggloSil+DC
Our WSI-based solution employs a WSI method that is being developed in our parallel work on WSI. It involves the following steps. For a particular target word its occurrences in old and new corpora are collected, and lexical substitutes are generated for each of them. Then WSI is performed by clustering bag-of-substitutes (BOS) vectors with agglomerative clustering employing silhouette score to select the number of clusters (AggloSil). Finally, we search for a decision cluster (DC), i.e. a cluster that has large number of occurrences from one corpus and small from another, and predict semantic change if such cluster exists. Next we provide detailed description of each step.
Lexical substitutes generation. We exploit the multilingual masked language model XLM-R (Conneau et al., 2019) to generate lexical substitutes for a given occurrence of some target word. The simplest option is replacing the target word in the given text fragment with a special token "<mask>" and passing this modified text to XLM-R, which is trained to generate words that can appear in masked positions. However, this usually results in substitutes that are not related to the target. Following (Amrami and Goldberg, 2019), we experimented with dynamic patterns. For instance, we may replace the target with "T and < mask>" and then replace T back with the target (this pattern is denoted "T and M" for brevity). Thus, given a sentence I love old planes with the ambiguous word planes, instead of I love old <mask>, the model receives I love old planes and <mask>. In the latter case, the model returns substitutes like vehicles, machines, stuff, etc., which are more closely related to the target, than substitutes returned in the former case like times, school, movies, etc. Since XLM-R was trained on raw texts while the task corpora are pre-processed and lemmatized, we finetuned it using the MLM objective on a small subset of 72K randomly sampled sentences with equal number of sentences from old and new corpora for each language. This accustoms XLM-R to work with pre-processed and lemmatized inputs, as well as return lemmas instead of word forms as substitutes. Finetuning on the whole task corpora may improve results, but requires much more computational resources, hence, we leave it for the future work.
In the post-evaluation phase we experimented with multi-subword substitutes such as "T and MM". This means inserting several masks at input, predicting topk most probable fillers for the first one and one most probable continuation for each of them (we also tried beam search, which gave no improvements). Additionally, we tried combining symmetric patterns, meaning that the probabilities of substitutes for the patterns "T and MM" and "MM and T" are multiplied before selecting the most probable substitutes.
Bag-of-substitutes (BOS) vectors. For each occurrence of a target word we only take topk substitutes with the highest probabilities. We suppose that the target word has multiple senses and thus it cannot be useful as a substitute, so we remove it from the substitutes. After that we filter substitutes that were generated for less than min df or more than max df fraction of occurrences of the same target, since too rare or too frequent substitutes are likely to be useless for discriminating between senses of this target. Finally, bag-of-substitutes vectors are built which are basically bag-of-words vectors for substitutes, not for the original text fragments.
Additionally, substitutes can be lemmatized. This was found useful for the default (non-finetuned) XLM-R language model, because it tends to generate substitutes in different grammatical forms depending on the context, thus increasing sparsity (Amrami and Goldberg, 2018). However, our MLM is finetuned to predict lemmas, so it is less important in our case.
Clustering. For each target word separately we cluster all BOS vectors of its occurrences from the old and the new corpora together. Following (Amrami and Goldberg, 2019) we use the agglomerative clustering algorithm with cosine distance and average linkage. The number of clusters that maximizes the silhouette score 2 of clustering is selected.
Final prediction. The decision function returning the final binary label for Subtask 1 is based on the labeling criterion provided by the task organizers . Namely, if there is a cluster containing less than k examples from the old or the new corpus and more than n examples from another one (we call it decision cluster, or DC), then we predict that the change of word senses took place.
To solve Subtask 2, for each target we build two vectors having dimensionality equal to the number of clusters and containing the number of examples from the old or the new corpora in each cluster. The final score is the cosine distance between these vectors.

Datasets
All our results reported below were obtained on the test sets presented in the competition SemEval-2020 Task 1 . They are based on English (Alatrash et al., 2020), German (Deutsches Textarchiv, 2017;Berliner Zeitung, 2018), Swedish (Språkbanken, Downloaded in 2019) and Latin (McGillivray and Kilgarriff, 2013) corpora. Since no train or development sets were provided, we employed WSI and LSCD datasets for the Russian language, namely, bts-rnc (Panchenko et al., 2018) and macro (Fomin et al., 2019), to select good hyperparameters while not overfitting to the test sets.

Subtask 1: Binary Classification
Our results during evaluation and post-evaluation periods for Subtask 1 are reported in tables 1 and 2. Since no estimates of class proportions were available during the evaluation period, we used the threshold of 0.5 in our SGNS+OP+CD solution for Subtask 1, meaning that we predict that the semantic change happened if the cosine distance is larger than 0.5. Figure 1 shows how the accuracy of the SGNS+OP+CD approach dramatically depends on the choice of this threshold. At the border points of the plot (0.0 and 1.0) the accuracy equals to the proportions of examples that belong to the positive or the negative class respectively. As we can see, SGNS+OP+CD accuracy with the selected threshold is only a little bit higher than the most frequent class (MFC) classifier accuracy on English and German, and is significantly lower on Latin and Swedish. The optimal threshold depends on the language and its performance is not much better than that of the MFC. Moreover, according to the table 1 the MFC accuracy is comparable to the best participants' results on Latin and Swedish. Hence, for our methods we report macro-averaged F1 score along with accuracy, which is the official metric. Additionally, it worth noting that due to the small number of test examples (30-50 words per language), one should be cautious when drawing any conclusions from the observed results. In particular, we tried to check the statistical significance of the difference between our results and the best result for each language using McNemar's test and Wilcoxon signed-rank test 3 , but both tests failed to reject the null hypothesis at the significance level of 0.05 for all languages except German. During the evaluation period we did not select the optimal hyperparameters of the WSI method due to the lack of time, instead all submissions using the BOS approach share most of the hyperparameters which were set intuitively. We used the dynamic pattern "M or T", topk=500, TF-IDF vectorizer with max df =0.98. The number of clusters was picked between 2 and 13 by silhouette score. The thresholds n, k were set to the values 0,1 for Latin and 2,5 for other languages, which are specified in the task description. The only hyperparameter we varied was min df , which is the minimal proportion of examples a particular substitute should be generated for to survive after filtering. According to our previous experience, it is crucial to select this hyperparameter, because it allows to filter out noisy substitutes while preserving useful ones. Table 1 compares our submissions during the evaluation phase to the winners of the competition. For BOS+AggloSil+DC method the best (v.3) among three submissions is shown, with min df set to 0.02 for Swedish and 0.01 for other languages.
In the post-evaluation experiments we switched to the hyperparameters selected on the Russian WSI dataset: the dynamic pattern "MM or T" combined with the symmetric one "T or MM", topk=150, count vectorizer with min df =0.03, max df =0.8. We have noticed that small n, k result in many false positives. Hence, we decided to use values that are optimal for the Russian LSCD dataset, which are 10,15. Selected hyperparameters improved the results for all languages except Latin (where the positive class dominates, hence, more conservative thresholds are harmful).
From table 2 we can see that without dynamic patterns substitutes consisting of two subwords ("MM") often outperform single subword substitutes ("M"). We cannot reliably tell whether the usage of the dynamic patterns is effective or not. The target words are mostly nouns and Amrami and Goldberg (2018) shows that symmetric patterns have relatively small impact on WSI quality for nouns. The finetuned model seems to perform better for all languages except Swedish (for Swedish accuracy is the same, but F1 score   Table 2: Subtask 1 post-evaluation results. *We change (n,k) from (10,15) to (0,1) for Latin and (2,5) for other languages.
is lower). The "MM and T" pattern seems to perform worse than the default one "MM or T" (except for German where the patterns perform similarly). If dynamic patterns are used, combining symmetric patterns or generating two subword substitutes did not show consistent improvements. Moreover, the "M or T" pattern with no patterns combination outperformed the default method on most languages (except German). This contradicts our results in WSI for the Russian language, where these techniques significantly improve WSI performance. One possible explanation is lemmatized datasets, which may negatively effect XLM-R performance even after finetuning. To check this, we have lemmatized Russian WSI dataset and found large drop in performance of WSI when non-finetuned XLM-R is used. Finetuned XLM-R performed much better, however still worse than non-finetuned version on non-lemmatized raw texts. The optimal values of k and n parameters appear to vary vastly for different languages and datasets. For instance, after changing them from the values selected on the Russian LSCD dataset to the values specified by the organizers, the results improved for Latin, but worsened for all other languages.  Table 3: Subtask 2 results, Spearman rank correlation.

Subtask 2: Ranking
On the second subtask the SGNS+OP+CD approach showed pretty good performance, especially on German and Swedish languages. Overall, we obtained 3rd best result. The results of the BOS+AggloSil+CD approach however are not that great. We assume that it is due to inappropriate number of clusters often selected by silhouette (in several cases there were only two clusters). The results are shown in table 3.

Decision Process Analysis
One of the advantages of the BOS+AggloSil+DC method is a possibility to explain its decision for a particular target word by analyzing clusters of occurrences of this word and corresponding substitutes. We have developed a web application, that visualizes the process of solving WSI or LSCD task by our system 4 . This application can be useful for error analysis and potentially can be helpful to linguists for their research in lexical semantics. In table 4 we provide examples of information from the application explaining system decisions. For each cluster we display the most probable substitutes for old / new examples. The probability is estimated as the proportion of examples from old / new corpus, which a particular substitute was generated for. Table 4 shows 7 most probable substitutes for the largest cluster (LC), often corresponding to the most frequent sense, and the decision cluster (DC).  Table 4: System decision explanation examples. Substitutes were obtained using the combination of "T or MM" and "MM or T" patterns, target words were not removed.
The first example (player) appears to gain the new sense "media player", since there is a cluster with almost all examples from the new corpus, in which frequent substitutes for the target word are receiver, recorder, etc. The second example (ounce) is an example of incorrect prediction of our system. The decision cluster includes only examples with the word ounce preceded by a number, such as three-ounce, 10-ounce, etc. Thus the language model mostly generates numbers as substitutes. This example shows that the specific cases of word usages can destabilize the performance of our system. For both words the decision cluster included only insignificant amount of examples from old corpus. Thus the top substitutes for old corpus examples in the decision cluster are irrelevant.

Conclusion
We proposed the WSI-based solution for the lexical semantic change detection problem. It outperforms the well-known SGNS+OP+CD baseline on Subtask 1. However, the baseline performs better for Subtask 2, where we obtained the 3rd best result. We have shown that finetuning of XLM-R on the lemmatized task corpora improves the final results for Subtask 1. A comparison of different variants of our method is provided. However, a larger dataset is required to draw reliable conclusions from these observations.

A Subtask 1 results depending from hyperparameters
In this section we show the dependence of macro-F1 score for subtask 1 from values of hyperparemeters. Figure 2 shows the effect of topk and min df on the performance. The rest hyperparameters are the same as in first line of table 1. Notice, that the variance of the results is very high: even small change in hyperparameter values may result in F1 score changing by more than 10 points, which is due to the small test set size. Due to this variance, even selecting hyperparameters on a development set of the same language, if such dataset exists, will likely be useless. Figure 3 shows the dependence of performance on the thresholds of the final decision step. We see clear dependence from k for English and German. This is due to imperfect WSI predictions that often put in each cluster a few examples of occurrences with unrelated word senses. Larger values of k allow the model to ignore this noise.