SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively good in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.


Introduction
The meaning of a word can vary not only with context, but also with time. The former phenomenon is commonly known as context-sensitivity, or, if the variation is of a more categorical nature, polysemy, whereas the latter phenomenon is captured by the term diachronic semantic drift (Kutuzov et al., 2018), or alternatively lexical semantic change . As an example, a term such as "beautiful" has one main meaning, but will nonetheless imply slightly different things depending on its context of use; "suit" on the other hand has several distinct meanings (e.g. as a verb or as a noun), while "mouse" has acquired a completely new meaning with the introduction of computer hardware. Of course, the distinction between context-sensitivity and polysemy is anything but clear-cut; this is a slippery theoretical slope, on which it is best to tread lightly. Even so, enabling the detection of such diachronic lexical semantic change across time could accelerate research in historical linguistics (Szymanski, 2017), and also initiate the development of decision-making systems that exploit diachronically shifting information (Rosin et al., 2017).
The backbone of a lexical semantic change detection system is word embeddings, which represent the meaning (or at least the use) of words. Different systems rely on various types of language models, nowadays predominantly distributional in nature. There is a comparably rich literature on distributional approaches to modeling diachronic semantic drift; examples include Sagi et al. (2008) Hamilton et al. (2016), and Yao et al. (2018). More complete overviews of existing diachronic semantic shift detection techniques is provided in Tahmasebi et al. (2018) and Kutuzov et al. (2018).
Contextualized language models constitute a recent breakthrough in the field of NLP (Devlin et al., 2018;Radford et al., 2018), by virtue of their ability to provide embeddings that are sensitive to a specific context of use, which is different from standard word embeddings that aggregate all of a word's contexts into one global representation. Another way of characterizing this difference is to say that contextualized language models provide token-based representations, while standard word embeddings are type-based. Motivated by the success of contextualized language models for handling polysemy and context-sensitivity, we investigate how such embeddings can be used to model diachronic semantic change; we use a contextualized language model as the basis of our proposed system for both sub-tasks of the Unsupervised Lexical Semantic Change Detection task, featured in SemEval 2020. More specifically, we produce contextualized embeddings for each occurrence of a term, and cluster these embeddings to arrive at a form of sense clusters. By leveraging multilingual contextualized representations, our approach is agnostic to which language is used in the input corpus, and it does not rely on any specific information about when the corpus was written. Despite the simplicity of our approach, our system ranked 10 th in Subtask 1 and 8 th in Subtask 2.
1.1 Subtask description 1.1.1 Subtask 1 Subtask 1 is Binary Lexical Semantic Change, as defined in . Given a set of terms and two corpora from two different time periods, the goal is to identify the terms with different sets of senses between the two corpora, and consequently time periods. The fact that the corpora are also provided in languages other than English pushes towards the direction of designing language-agnostic systems (Kutuzov et al., 2018). No labels are provided, therefore the system should be unsupervised. Annotations provided by  will be used as ground truth labels in the evaluation phase of the proposed systems.

Subtask 2
Subtask 2 is a modified version of Subtask 1, where a ranking of the given set of terms should be produced, based on the degree that the distribution of senses has shifted between the two documents, or time periods. This is referred as Graded Lexical Semantic Change in the literature . The difference between the two normalized sense distributions, namely Jensen Shannon Divergence (Lin, 1991), is used as the ranking criterion. Both the corpora and the annotations are the same with the ones used in Subtask 1.

Data description
The data for the two tasks are the same and consists of four languages with two corpora per language. The four languages are English, German, Latin and Swedish. The two corpora are divided into two different time periods. The data has been lemmatized and converted to lowercase. For each of the corpora there are some particularities noted in the SemEval-2020 Task 1 data description 2 . The most important particularity is the frequent OCR errors found in several of the corpora, lowering the quality of the data.

Solution Outline & Main Idea
Given a word W we generate the contextualized embeddings for all occurrences of W in the two corpora (C 1 and C 2 ) while keeping a reference to the source corpora. The contextualized embeddings from both corpora are then clustered. Each occurrence of a word is thus represented by its contextualized embedding, source label and cluster label. We then solve the tasks using cluster labels as a direct proxy for senses. We refer to this method as SenseCluster.
The main idea is that contextualization, in part, serves to disambiguate between senses: we hypothesize that the contextualized embeddings of cell in the phone-sense, in general, are closer to each other in embedding space than the contextualized embeddings of cell in the chamber-sense, and vice versa. In other words: we hypothesize that the senses of a word W manifests themselves as clusters in the contextualized embeddings of W . The origins of this idea can be traced back to the work of (Schütze, 1998). Recent work in Lexical Semantic Change using BERT gives credence to our hypothesis : (Giulianelli et al., 2020;Martinc et al., 2020) perform k-means clustering on BERT representations of target words in order to detect temporal semantic change in a large diachronic English corpus (Davies, 2012). Additionally, (Wiedemann et al., 2019) show that a simple k-Nearest Neighbor classifier (Cover and Hart, 1967) on contextualized representations can be used for word sense disambiguation.

Contextualized Embeddings: XLM-R
Word embeddings can handle synonymy, but not polysemy (at least not in any obvious way; but there are some attempts at uncovering polysemy in word embeddings, such as Relative Neighborhood Graphs (Cuba Gyllensten and Sahlgren, 2015)). Contextualized language models on the other hand do; for each occurrence of a term, a contextualized language model will produce a contextualized embedding, which takes into account the surrounding context. Prominent examples of contextualized language models are BERT (Devlin et al., 2018) and GPT (Radford et al., 2019). BERT is a Transformer-based model (Vaswani et al., 2017), which produces deep bidirectional representations, as a result of the masked language training objective. On the other hand, GPT is a unidirectional, Transformer-based model (Vaswani et al., 2017), which is pre-trained using the standard language modeling objective (Radford et al., 2019). Pre-trained contextualized language models are often transferred to task-specific architectures (Devlin et al., 2018), (Radford et al., 2019), based on previous work (Howard and Ruder, 2018).
Contextualized language models are extensively used in the domain of cross-lingual language understanding (Lample and Conneau, 2019), . Apart from being beneficial for cross-lingual understanding tasks, contextualized cross-lingual embeddings enable model transfer between languages (Ruder et al., 2019). The latter can be beneficial for low-resource languages.
We use XLM-R  for producing term representations. XLM-R is a Transformerbased masked language model, trained on 2.5T of filtered CommonCrawl data in 100 languages. Comparing to previous multilingual masked language models, such as multilingual BERT (mBERT) (Devlin et al., 2018) and XLM (Lample and Conneau, 2019), the size of the pre-training dataset of XLM-R is increased by several orders of magnitude, especially for low-resource languages . XLM-R outperforms mBERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019) in cross-lingual classification, as well as monolingual tasks .   We hypothesize that our multilingual setting can benefit from the use of XLM-R, by virtue of crosslingual transfer. This can be especially advantageous for less resourced languages such as Latin, and perhaps also Swedish.

Clustering: K-Means
Our approach to the problem of semantic shift detection is cluster-based. We use K-Means++ (Arthur and Vassilvitskii, 2006) to induce the optimal set of sense-clusters in the contextualized embedding space. K-Means++ is a modified version of the widely used K-Means clustering algorithm, which splits the input data points into a predefined set of clusters, by minimizing the in-cluster average square distance (Lloyd, 1982). Trying to alleviate the dependency of K-Means performance on proper initialization of the cluster centroids, K-Means++ introduces a randomized seeding technique (Arthur and Vassilvitskii, 2006). Previous work (Wiedemann et al., 2019) shows that distance-based methods, such us k-Nearest Neighbor classifier (Cover and Hart, 1967), can be used to group contextualized embeddings which encode the same sense-information. Given the unsupervised nature of our task, K-Means++ is a reasonable choice for our system.

Method
We generate contextualized embeddings of target words using XLM-R. 3 Given, for example, target word edge, and the sentence "they sit down together upon the edge of the bed", the whole sentence is passed as input to XLM-R, we then extract the embedding from the output layer corresponding to the word edge, i.e. its contextualized embedding. In the case when the target word consists of several wordpieces, and thus several embeddings, we take the average of these. We then cluster all contextualized embeddings for a target term using K-Means++ 4 with the distance metric set to euclidean. For simplicity, we set the number of clusters to 8 for all target terms and languages.
To measure diachronic shift between the two corpora we aggregate this clustering into a table as seen in Table 3 by counting the number of occurrences per cluster label and source.

Word#
Corpus 1 Corpus 2 Cluster / Sense 1 12 40% 1 3% Cluster / Sense 2 18 60% 11 37% Cluster / Sense 3 0 0% 18 60% Table 3: Example of a cluster assignment for the contextualized embeddings of a word. For Subtask 1 we say that there has been a sense change if there exists a cluster such that it contains < 2 occurrences from corpus 1 and > 5 occurrences from corpus 2, or vice versa. In this example, going from Corpus 1 to Corpus 2, the word lost Sense 1, but gained Sense 3. For Subtask 2 we measure the Jensen Shannon Divergence between the sense distributions of the corpora. In this example, Corpus 1 has sense distribution (0.4, 0.6, 0), whereas Corpus 2 has sense distribution (0.03, 0.37, 0.6), which results in a Jensen Shannon Divergence of ≈ 0.73.

Subtask 1
Using the cluster labels as a proxy for senses, we solve the first task using the method described in the task reference . If there exists a cluster such that it contains < k occurrences from corpus 1 and > n occurrences form corpus 2, or vice versa, we say that there has been a sense change. We always let k = 2, n = 5, and set the number of cluster to 8, regardless of language and the total number of occurrences. For example, given the cluster assignments in table 3 we would say that there has been two sense changes: Going from Corpus 1 to Corpus 2, the word lost Sense 1, but gained Sense 3.
We consider this our baseline approach and while the hyperparameters for the number of clusters, k and n can be tuned, we believe that a different choice of clustering algorithm would yield larger improvements in performance.

Subtask 2
Subtask 2 is also solved by a direct translation of the task definition , i.e. we solve it by computing the Jensen Shannon Divergence between the cluster distributions of the two corpora. We use the same cluster assignments in Subtask 2 as in Subtask 1.

Results
The code for the experiments is made publicly available 5 .  the best performing baseline is the CNT+CI+CD model, a co-occurrence counting method . For Subtask 1 our method outperforms the best performing baseline on average, but performs worse for English and German. For Subtask 2 our method outperforms the best performing baseline for all languages by a wide margin except for Latin, where the performance is only slightly better.

Discussion & Conclusion
We chose XLM-R because it is a pre-trained, performant, single, contextualized model trained on all the languages in the task. As such, the method easily extends to all other languages XLM-R has been trained on. However, the choice of XLM-R has certain drawbacks. It is trained on CommonCrawl data, which we assume is heavily skewed towards contemporary language. This might have a negative impact on model performance, since the task is dominated by historical data. More importantly, we believe that the biggest drawback of using XLM-R (or similar language models) is that it is trained with minimal preprocessing. The task data, on the other hand, was heavily lemmatized, PoS-tagged (for English), and had frequent OCR errors. We believe this mismatch in preprocessing methods and data quality has had a very detrimental effect on the quality of the contextualized embeddings we extract from XLM-R. Our approach, while not yielding great results, outperformed the baselines and scored relatively well on task 2. We argue that it performed surprisingly well given the simplicity of the approach: our method can be condensed into: (i) the hypothesis that clusters of tokens in contextualized embedding space approximate senses (a conjecture that is also further corroborated by other recent work (Wiedemann et al., 2019)), (ii) the implicit (and completely preposterous) assumption that every term has eight senses If we assume that (i) is true, then the apparent falsehood of (ii) poses two problems: if a term has more than eight senses, our method will conflate senses by putting them in the same cluster, a supersense, if you will. If it has less than eight senses, our method will split senses into subsenses. Both of these scenarios are problematic for our model: a subsense change might occur without a sense change, and conversely, a sense change might occur within a supersense without a supersense change. We hypothesize that this effect is greater in Subtask 1 than in Subtask 2, due to the more discrete nature of Subtask 1. One possible remedy to this is to use a data driven method to determine the number of clusters, for example X-means