SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces

Lexical semantic change detection (also known as semantic shift tracing) is a task of identifying words that have changed their meaning over time. Unsupervised semantic shift tracing, focal point of SemEval2020, is particularly challenging. Given the unsupervised setup, in this work, we propose to identify clusters among different occurrences of each target word, considering these as representatives of different word meanings. As such, disagreements in obtained clusters naturally allow to quantify the level of semantic shift per each target word in four target languages. To leverage this idea, clustering is performed on contextualized (BERT-based) embeddings of word occurrences. The obtained results show that our approach performs well both measured separately (per language) and overall, where we surpass all provided SemEval baselines.


Problem Setup
Consider two corpora C 1 and C 2 for a same language but associated with different time stamps (say, respectively, t 1 and t 2 > t 1 ). Let W be a set of target words occurring in both corpora. Each target word w ∈ W might assume multiple meanings, to be called senses, within the two corpora. A pool of experts annotated a representative amount of occurrences with their corresponding senses. The problem we consider is to characterize the semantic shift related to those senses from one corpus to the other without having access to the expert annotations. In particular, we address the two following two subtasks: • Subtask 1: Decide, for each w ∈ W, whether or not w gained or lost at least a sense between t 1 and t 2 . This is a binary decision task. We will denote this subtask as (S1).
• Subtask 2: Define, for the elements of W, a measure of their degree of lexical semantic change between t 1 and t 2 and sort these elements consequently. This is a ranking task. This subtask will be referred to as (S2).
We 1 describe two different methods able to address both subtasks. As both methods require a preprocessing step based on transformers, let us start from this preliminary operation.
these sentences, to get the token-level representations. BERT-base model with twelve transformer layers is used and we derived the final embedding by concatenating the final four layers. If a single word gets split by the tokenizer, we take the average embedding value of the sub-tokens. Thus, for each target word we extract the embeddings from all the sentences associated with it from both corpora, C 1 and C 2 . For corpora other than English language, we used multilingual BERT models of respective languages. We used the pre-trained model to generate the embeddings for all experiments, since the task is completely unsupervised in nature.
Let us assume, for each w ∈ W, that S j := {s sentences in corpus C j where target word w occurred, for each j = 1, 2. The transformer maps those sentences into corresponding vectors of a d-dimensional space. Let us perform a single transformation for both S 1 and S 2 simultaneously. Denote as x ij , for each i = 1, . . . , n w,j and j = 1, 2. We regard the relative distances between these vectors as proxy indicators of their semantic similarity. This helps to cluster the similar senses together. While different distances, such as cosine, Manhattan and Euclidean could be used to measure BERT embedding similarities, we adopt the Euclidean distance given the findings in (Hewitt and Manning, 2019;Inui et al., 2019;Reif et al., 2019) which demonstrate that the syntax tree distance between two words in BERT embedding space corresponds to the square of the Euclidean distance. In the next two sections, in order to address subtasks (S1) and (S2), we focus on the vectors associated with the same target word w and use their relative Euclidean distances as indicators of possible semantic shifts from one corpus to the other.
Both methods will be based on clustering algorithms used to cluster the vectors associated with a given target word of a single corpus or of the union of the two. We adopt the classical k-means clustering algorithm, which forms the clusters by attempting to minimize the intra-cluster variance. So called silhouette method is used for the selection of the optimal number of clusters k and the initialization (i.e., the position of the centroids before starting the algorithm) (Rousseeuw, 1987). Accordingly, given a value of k, we compute in the corresponding cluster, the means of both the nearest-inter cluster distance and the nearest-cluster distance. The difference between these two quantities normalized by the maximum of the two is used as a fitness score to be maximized in order to select the optimal value of k. The same approach is used to determine the initial centroids. In this case, we use the optimal k value and run the k-means algorithm for N iterations, to determine the best centroids.

Method 1: Joint Clustering Vectors of Both Corpora
Let us focus on a particular target word w ∈ W. Accordingly, for the sake of readability, denote its vectors in corpus C j simply as X j := {x ij } n j i=1 , for each j = 1, 2. We cluster the whole set of vectors of the two corpora, say X := X 1 ∪ X 2 , and denote as {X k } m k=1 the n clusters returned by the algorithm. Note that we cope with hard clustering methods, i.e., ∪ n k=1 X k = X and X k 1 ∩ X k 2 = ∅ for each k 1 , k 2 = 1, . . . , m, with k 1 = k 2 . For each cluster X k we count how many of its elements belong to X 1 , say n 1,k , and to X 2 , say n 2,k . We call impure a cluster such that both n 1,k > 0 and n 2,k > 0. As we regard the clusters as equivalence classes for the abstract notion of sense, if all the m clusters are impure it means that no new senses appeared in C 2 and no new senses have been lost from C 1 to C 2 . If this is not the case we might have new senses in C 2 , i.e., there is at least a k such that n 1,k = 0, or, vice versa, an old sense has been lost, i.e., there is at least a k such that n 2,k = 0. Following the guidelines of the SemEval shared task, we might set a lower bound n to the number of occurrences of a word in a cluster before deciding to regard it as a new sense. If this is the case the above conditions for the counts equal to zero should be replaced by n j,k < n. Overall, this procedure corresponds to a sound algorithm to address (S1). We refer to it as M1S1.
Regarding (S2), after the clustering, we might define a random variable S, to be called the sense variable, whose m states are in one-to-one correspondence with the clusters. The variable denotes how likely is finding an occurrence of w with sense S in a corpus. Accordingly, we might use the counts {n j,k } m k=1 to learn a probability mass function P j (S) for each j = 1, 2. Following a Bayesian approach, based on a Laplace uniform prior with equivalent sample size σ > 0 (Gelman et al., 2013), we have: (1) In such a probabilistic setup, the semantic shift of the target word w between the two corpora can be therefore described by the dissimilarity between the mass functions P 1 (S) and P 2 (S). We measure that by the Shannon-Jensen distance SJ, i.e., a symmetrization of the popular Kullback-Leibler divergence. This semantic shift of w corresponds therefore to the distance δ, with δ := SJ(P 1 , P 2 ) and SJ(P 1 , P 2 ) := 1 2 [KL(P 1 , P 2 ) + KL(P 2 , P 1 )] and KL(P 1 , P 2 ) := m k=1 P 1 (s k ) ln P 1 (s k ) P 2 (s j ) . Note that with the Bayesian smoothing in Equation (1), we cannot have zero probabilities and degenerate values in the computation of the distance. The overall procedure gives an algorithm to address subtask S2, as this corresponds to sort the elements of W with respect to their value δ. We refer to this procedure as M1S2.

Method 2: Separate Clustering of the two Corpora
In this section, while still focusing on a given target word w ∈ W, we consider a different approach based on the separate clustering of the two set of vectors X 1 and X 2 .
denote these two sets of clusters. As in the previous method we regard each cluster as a representative model of a sense. Yet, unlike the previous case, here we need to define a map between the clusters of the first corpus and those of the second. As discussed before we adopt the Euclidean distance between the vectors as a proxy indicator of semantic similarity. In order to cope with single numerical values, for the sake of simplicity, we represent each cluster with its center of mass. Letx k j 1 denote the center of mass of X k j j for each k j = 1, . . . , m j and j = 1, 2. If m 1 = m 2 , i.e., the two corpora have the same number of clusters, we can reduce the identification of the map between the clusters of the two corpora to a minimum weight matching in a complete bipartite graph, whose nodes are associated with the two sets of clusters and whose weights are the Euclidean distances between the centers of mass. If this is not the case and, for instance, m 1 > m 2 , we add m 1 − m 2 dummy clusters to the second corpus and set to zero the weights for all the arcs connecting these elements. We similarly proceed if m 2 > m 1 .
The optimal matching minimizing the sum of the weights can be computed in cubic time with the classical Hungarian algorithm (Kuhn, 1955;Jonker and Volgenant, 1987) and the results is a one-to-one correspondence between the clusters, no matter whether proper or dummy, of the two corpora. As a dummy cluster in a corpus has zero distance from all the clusters of the other corpus, the matching returned by the Hungarian algorithm is properly minimizing the distance between the proper clusters. Two proper clusters in the two corpora matched by the algorithm are intended as representative of the same sense. Proper clusters of a corpus pointing to dummy cluster are regarded instead as a new sense appeared in the second corpus only, or old sense occurred in the first corpus only.
After the matching, we define a single clustering with m := max{m 1 , m 2 } clusters and proceed exactly as in the previous section. In practice, the vectors of two clusters matched by the Hungarian algorithm are assigned to a single, impure, cluster, while those linked to dummy clusters produce pure clusters. We term M2S1 and M2S2 the two algorithms corresponding to the approach discussed in this section to address the two subtasks. Next section describes the experimental analysis and evaluation results.

Method 3: An alternative approach for Subtask 2 (S2)
As an alternative to the previously explained procedure for handling (S2), based on Bayesian approach and Shannon-Jensen divergence, we consider another approach, exploiting only the number of word occurrences per cluster and corpora. More precisely, assuming that we have K clusters in total and ∀k ∈ {1, .., K} already calculated n 1,k and n 2,k from each corpora (regardless whether clusters come from single clustering in M1 or after performing optimal cluster matching in M2), we define the coefficient of semantic change of the word (ranking in (S2) terminology), as: |S 2 · n 1,k − S 1 · n 2,k | where S 1 = K k=1 n 1,k and S 2 = K k=1 n 2,k . Let us assume that word w has p occurrences in both corpora. It is trivial to see that in the case with K = 2 and clear cut between corpora (e.g. all occurrences in cluster 1 belong to C 2 and all occurrences in cluster 2 belong to C 1 , i.e. n 2,1 = n 1,2 = p, n 1,1 = n 2,2 = 0, our coefficient equals 1, which indicates complete change of sense. Likewise, if the distribution of occurrences is uniform (n * , * = p/2), it yields 0, meaning no sense change. We denote these two new procedures for (S2) for M1 and M2 as NM1 and NM2, respectively.

Experimental Analysis
Experimental analysis is performed according to the rules posed by SemEval2020 challenge 2 organizers, using provided corpora (2) and baselines (3). Corpora are provided in four languages: English (Alatrash et al., 2020), Latin (McGillivray and Kilgarriff, 2013), German (Textarchiv, 2018) and Swedish (Adesam et al., 2019). Table 1 provides brief statistics for given corpora stating the number of target words (NTW), the total and average number of sentences containing target words (NSTW) per each language. The three baselines provided are: normalized frequency difference (FD), count vectors with column intersection and cosine distance (CNT+CI+CD) and a random baseline always predicting a majority class (RND/MC) -for details see (Schlechtweg et al., 2019).
For evaluation purposes (as instructed by SemEval guidelines), accuracy is exploited for (S1), while Spearman coefficient, taking values between -1 (corresponding to negative correlation) and 1 (perfect correlation), was used for (S2). More details can be found in the system description paper .
It is worth mentioning that the upper bound for the number of clusters for K-means which could be retrieved by the silhouette score was set to 10.
As explained before, for (S1), the idea was to compare the number of elements of each cluster X k coming from different corpora, say n 1,k and n 2,k , and claim a change in senses if ∃k : n 1,k = 0 ∨ n 2,k = 0. This indeed was the procedure applied for Latin corpora. For other languages (with larger sizes of corpora), following the guidelines of the SemEval shared task, additional restrictions in terms of lower (n) and upper bounds (n) were set, with the following purpose: word is considered as gaining a new sense, if ∃k : n 1,k ≤n ∧ n 2,k ≥ n (and vice versa for losing a sense). Additionally, suggested values for these bounds were set to n = 5 andn = 2.
The code 3 is implemented in Python using Scikit (Buitinck et al., 2013) and Transformers (Wolf et al., 2019) library.  The experimental results on the corpora over the two subtasks (S1) and (S2) are reported using the methods M1 and M2, in Table 2. Results are shown for the four target languages and overall, as well as compared with provided SemEval baselines. Best results per language and subtask are underlined. Overall best results per subtask are denoted in boldface. As can be seen, except for the Latin, the proposed methods are outperforming all baselines on (S1) with the procedure M1S1 being the best for German and Swedish and M2S1 for English. On the other hand, for (S2), results are quite corpus/language dependent, M1S2 scores best for English, M2S2 for Swedish, while baseline 2 (CNT+CI+CD) wins over all the others for Latin and German. Overall, Method 2 outperforms its competitors on both subtasks. The performance with the contextualized embeddings is actually comparable with the baseline approaches in many cases. This could be the fact that pre-trained embeddings from BERT may not be completely suitable for representing meaningful sentence vectors for clustering (Reimers and Gurevych, 2019). These factors have to be investigated in the future.

Language
Method 1 Method 2 SemEval Baselines FD CNT+CI+CD MC (S1) (S2) (S1) (S2) (S1) (S2) (S1) (S2) (S1) (S2)   Figure 1a shows an example of the application of method M1 for the English target word tip, based on 2D t-SNE (Maaten and Hinton, 2008) projections. The method produces two clusters, each denoted with a different color. As it can be seen, one of the clusters (orange, k = 1) is remarkably larger than the other (blue, k = 2). Additionally, it is also quite impure containing word occurrences from both corpora (more precisely, n 1,1 = 112 and n 2,1 = 211), while the other cluster contains only 31 instance whose distribution is n 1,2 = 1 and n 2,2 = 30. Given that n 1,2 = 1 ≤ 2 =n and n 2,2 = 30 ≥ 5 = n, method M1S1 correctly detects the sense change for the word tip. Figure 1b shows an example of the application of method M2 for the English target word lane (2D t-SNE projections). The clustering algorithms produce three clusters for each corpus, each denoted with a different color, and the matching algorithm detect the correspondence between clusters minimizing the distances between the centers of mass, depicted as squares in the figure. Matching clusters of the two corpora are depicted with the same color.
The results of the alternative method M3 for subtask (S2) with respect to M1 and M2 and two baselines (FD and CNT+CI+CD), are provided in Table 3. We can see that NM1 improves results on English and German languages, and overall.

Related Work
The task of identifying words whose meaning has changed over time is well-known and the related literature is, therefore, resourceful with many recent advancements. Albeit, there are still quite some issues to be resolved, primary regarding the methodology and the respective ground truth (missing semantic change annotations).
As for the latter, the first step is to decide whether to aim for a binary response (equivalent of Sem-Eval2020 subtask 1) or to provide graded ratings of a sense change (equivalent of SemEval2020 subtask 2). Given that in both cases, but particularly with graded rating, inter-annotator agreement rates vary greatly, as evidenced in (Erk et al., 2009), establishing a definition of a standard test set is extremely difficult. In (Schlechtweg et al., 2018) a unifying evaluation framework for unsupervised lexical semantic change detection was proposed based on changes in relatedness of word use pairs in each time period.
Regarding the former, many different approaches for unsupervised lexical semantic change detection have been suggested. A detailed survey of studies can be found in (Kutuzov et al., 2018;Tahmasebi et al., 2018). Most notably, several works (Baroni et al., 2014;Kim et al., 2014;Hamilton et al., 2016) showed the benefits of using dense word representations for semantic shift detection. Furthermore, (Kulkarni et al., 2015) showcased that these outperform the frequency-based methods. However, unlike our work, none of these works exploits clustering. More close to our approach, that is, considering clusters as representative semantic areas, are the works of (Mitra et al., 2014) and (Dubossarsky et al., 2015). The main differences are however, that in (Mitra et al., 2014), clustering is performed on the level of the ego-network of each word, where the network is constructed based on word co-occurences, while we perform clustering of the word embeddings itself. Additionally, in contrast to (Dubossarsky et al., 2015) where the authors consider incremental learning of word embeddings in yearly chunks and vary the number of clusters from 500 to 5000, we use silhouette scores to determine the optimal number of clusters per each target word.

Conclusion and Future Work
A word can have a different meaning (sense) in different contexts and/or different time periods. Despite being quite extensively studied, the problem of identifying words that have changed their meaning over time, particularly in an unsupervised way, still challenges researchers.
In this work, we propose two approaches, both of which combine contextualized word embeddings (obtained by BERT) and clustering, differing thus only in the way the clustering has been performed. Considering obtained clusters as proxies for word meanings allows us to quantify the level of change per each target word in four target languages. The obtained results, especially looking overall, across all target languages, where we are outperforming all provided baselines, demonstrate the usefulness of the suggested approach.
As potential directions for future work we plan to investigate various strategies, including different clustering methods and time-wise comparison of target words nearest-neighbours, in an attempt to identify actual word senses more accurately. Additionally, we would like to further scrutinize how the linguistic particularities of different corpora might have contributed to the variability of the results.