GM-CTSC at SemEval-2020 Task 1: Gaussian Mixtures Cross Temporal Similarity Clustering

This paper describes the system proposed by the Random team for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. We focus our approach on the detection problem. Given the semantics of words captured by temporal word embeddings in different time periods, we investigate the use of unsupervised methods to detect when the target word has gained or lost senses. To this end, we define a new algorithm based on Gaussian Mixture Models to cluster the target similarities computed over the two periods. We compare the proposed approach with a number of similarity-based thresholds. We found that, although the performance of the detection methods varies across the word embedding algorithms, the combination of Gaussian Mixture with Temporal Referencing resulted in our best system.


Introduction
The recent development in word embeddings, and their increasing capability to capture lexical semantics has inspired the application of these methods to new tasks and introduced new challenges. The diachronic analysis of language is one of these linguistic tasks which has benefited from the advantages of these new methods, i.e. the capability to build semantic representations of words by skimming through large corpora spanning multiple time periods. SemEval 2020 Task 1  addresses the current lack of a systematic approach for the evaluation of automatic methods for diachronic analysis by proposing a common evaluation framework that comprises two tasks and covers corpora written in four different languages, namely German (Zeitung, 2018;Textarchiv, 2017), English (Alatrash et al., 2020), Latin (McGillivray and Kilgarriff, 2013), and Swedish (Borin et al., 2012). Given two corpora C 1 and C 2 for two periods t 1 and t 2 , Subtask 1 requires participants to classify a set of target words in two categories: words that have lost or gained senses from t 1 to t 2 and words that did not, while Subtask 2 requires participants to rank the target words according to their degree of lexical semantic change between the two periods. We tackle the problem of automatically detecting lexical semantic changes with approaches that rely on temporal word embeddings. These approaches create a word vector representation for each time period by exploiting a shared semantic space. Similarity measures can then be used to capture the extent of a word semantic change between two periods. Some temporal word embedding techniques adopt a two-step approach, where they first learn separate word embeddings for each time period and then align the word vectors across multiple time periods (Hamilton et al., 2016). Other dynamic approaches incorporate the alignment directly into the learning stage via the optimisation function (Tahmasebi et al., 2018). Dynamic word embeddings can be further categorised according to the constraint imposed on the alignment. The explicit alignment adopts a conservative approach to the semantic drift that a word can undergo by posing a limit to the distance between the word vectors belonging to the two temporal spaces. In the implicit alignment, there is no need for explicit constraint since the alignment is automatically performed by sharing the same word context vectors across all the time lapses.
In this work, we focus on dynamic word embeddings by exploring methods based on both explicit, such as Dynamic Word2Vec (Yao et al., 2018), and implicit alignment, namely Temporal Random Indexing This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. (Basile et al., 2015) and Temporal Referencing (Dubossarsky et al., 2019). We analyse the use of different similarity measures to determine the extent of a word semantic change and compare the cosine similarity with Pearson Correlation and the neighborhood similarity (Shoemark et al., 2019). While these similarity measures can be directly employed to generate a ranked list of words for Subtask 2, their adoption in Subtask 1 requires further manipulation. We introduce a new method to classify changing vs. stable words by clustering the target similarity distributions via Gaussian Mixture Models. We describe the embedding models and the clustering algorithm in Section 2, while Section 3 provides details about the hyper-parameter selection. Section 4 reports the results of the task evaluation followed by some concluding remarks in Section 5.

System description
We model the problem of automatic detection of semantic change by exploiting temporal word embeddings E i : w → R d that project each word w in the vocabulary V into a d-dimensional semantic space. Given two different time periods t 1 and t 2 , we create two embeddings E 1 and E 2 . We investigate several models to compute temporal word embeddings: Dynamic Word2Vec (DW2V) (Yao et al., 2018) simultaneously learns time-aware embeddings by aligning and reducing the dimensionality of time-binned Positive Point-wise Mutual Information matrices.
Temporal Random Indexing (TRI) (Basile et al., 2015) implicitly aligns co-occurrence matrices by using the same random projection for all the temporal bins.
Collocations extracts for each word and each time period the set of relevant collocations through the Dice score. As similarity function, we measure the cosine similarity between the sets of collocations belonging to the two different time periods. More details are reported in Basile et al. (2019).
Temporal Referencing (TR) (Dubossarsky et al., 2019) used only in the post-evaluation, it consists in a modified version of Word2Vec Skipgram that adds a temporal referencing to target vectors, keeping context vectors unchanged.
A similarity measure between vectors in the two temporal spaces is adopted to compute the extent of the semantic drift of the target words. We explored several similarity measures: Cosine similarity (CS) is the cosine of the angle between two vectors.
Pearson correlation (PC) measures the linear correlation between two variables, in case of centred vectors (with zero means) is equivalent to the cosine similarity.
Neighborhood similarity (NS) computes two k-neighbour sets nbrs k (E 1 (w)) and nbrs k (E 2 (w)) and the union set U = nbrs k (E 1 (w)) ∪ nbrs k (E 2 (w)). Two second-order vectors, one for each word representation u j , are created. The components of u i are the cosine similarity between the vector v j 1 and the i-th element of U: u j i = cos(v j , U(i)). The Neighborhood similarity is the cosine similarity between the second-order vectors. In all the experiments we set k = 25.

Subtask 2
In Subtask 2, we use one of the three similarity measures (CS, P C, N S) to compute the set of target similarities S = {sim(E 1 (w), E 2 (w)) | w ∈ T }. Then, we rank the target words according to the distance, computed as: 1− | sim(E 1 (w), E 2 (w)) |.

Subtask 1: Gaussian Mixture Clustering
Subtask 1 requires a further step: given S, the set of target similarities, we need to predict the target labels. The aim is to assign either of the two classes, 0 (stable) or 1 (change), to each target word of a given language. Once we compute the set of target similarities S, we want to find a way to assign the corresponding label. We assume that low similarities suggest changing words and high similarities indicate stable words. Gaussian Mixture Models (GMMs) allow us to build probabilistic models for representing the Gaussian distribution of stable and changing targets. We use GMMs 2 to model the density of the distributions of the similarities of targets as a weighted sum of two Gaussian densities (Huang et al., 2017): where M is the number of mixture components, φ(S|µ m , Σ m ) is the Gaussian density with mean vector µ m and covariance matrix Σ m , and π m is the prior probability for the m-th component. Additional constraints can be applied to the covariance matrix in Eq. 1. In our experiments, we allow each component to have its own covariance matrix. For our purpose, we speculate that the distribution of target similarities is a mixture of two densities, i.e. representing the stable and changing words. Consequently, we fix the number of the mixture components in the GMMs to two. We initially randomly assign a label (stable/changing) to each density distribution. Let µ 0 and µ 1 be the means of the two Gaussians associated with the "stable" and "changing" labels respectively. If µ 0 < µ 1 (i.e. the similarity mean of the distribution labelled as "stable" is lower than the mean of distribution labelled as "changing"), we invert the labels. Alg. 1 can be used to properly label each word of the target vocabulary.
Algorithm 1: Assign labels input :S output :labels N (µ 0 , σ 0 ), N (µ 1 , σ 1 ), labels ←− GaussianM ixtures(S); if µ 0 < µ 1 then labels ←− 1 − labels; end In order to set the best parameters for each language and model, we rely on the GMMs log likelihood, which is generally used for estimating the clusters quality: where θ are the parameters of the GMM. For each language, we select the best model configuration to submit at the challenge using the GMMs log likelihood (θ | S). This means that hyper-parameters across different languages are tuned using GMMs log likelihood. We improperly use this approach for choosing parameters across different models (different sets of similarities S), as we do not have validation set for tuning the parameters. We will investigate this limitation as future work. The selected models and hyper-parameters are reported in Tab. 1. In particular, we use cosine similarity, Pearson correlation and Neighborhood similarity for computing the targets similarities in Overall CS , Overall P C and Overall N S runs, respectively. In DW 2V and T RI runs we use always cosine similarity.

Experimental Setup
In all the runs, we do not pre-process data and we use a context window size of 5 while analyzing sentences. The T R model 3 has been adopted into its original implementation 4 , as the T RI 5 approach and DW 2V 6 one. For runs involving T RI, we experimented with a varying vector size from 200 to 1, 000. Moreover, we investigated (1) the initialization of the count matrix at time j with the matrix at time j − 1, (2) the contribution of positive-only projections, and (3) the application of PPMI weights, as explained in QasemiZadeh and Kallmeyer (2016). For DW 2V , we use the parameter setting proposed in Yao et al. (2018). We set λ = 10, τ = 50, γ = 100, ρ = 50 and experimented with a number of iterations from one to five. As vocabulary, we kept the top 50,000 most frequent tokens for both T RI and DW 2V . In the T R runs, we set the vector size to 100, and we experimented eight iterations for English and Latin, and four for German and Swedish. We use 20 negative samples, keeping only the tokens that occur at least 10 times. All the other parameters used for configuring the models are reported in Tab. 1.

Run
Configuration English German Latin Swedish

Results
SemEval 2020 Task 1 provide three baselines, namely Freq. Baseline, which uses the absolute difference of the normalized frequency in the two corpora as a measure of change; Count Baseline, which implements the model described in ; and Maj. Baseline that always predicts the majority class. Tab. 2 reports the main results obtained by the different models. It shows the results obtained from the official submissions at the challenge and the results obtained by the T R approach performed during the post-evaluation phase. The results obtained for Subtask 1 are reported using the accuracy metric, while for Subtask 2, the Spearman's rank-order correlation coefficients are used.
Considering the results of the evaluation phase, the models show inconsistent behaviors. T RI showed the best performance when considering "all the languages" for both Subtasks, although in Subtask 1 it is not able to overcome Count Baseline and Maj. Baseline. Focusing on Subtask 1, if we consider each language in isolation, we see that DW 2V gives the best results for English 7 while Overall P C is our best system for German language, although it is not able to overcome Count Baseline. Collocation is the best system for Latin (although outperformed by Freq. Baseline) while T RI is our best system for Sweden language. In Subtask 2, the best English score was reported by Overall N S . Overall CS (Collocation) performed the best in German language. For Latin and Sweden, T RI provided the best results, and interestingly, it is one of the few systems that did not generate a negative correlation, although outperformed by CountBaseline in Latin language.
At the end of the challenge, when the labelled test set was released, we performed more experiments reported in the post-evaluation row. In this phase, we run an additional system, T R, which outperformed all the previous reported approaches, including all baselines. The only exception is for Latin, in which for Subtask 1 Freq. Baseline achieves 0.650 accuracy in comparison to 0.525 of T R. Comparing T R and T RI, which are both based on implicit alignment, the former is a prediction-based model while the latter is a count-based one. Moreover, T R creates a temporal word embedding only for the target words rather than for the whole vocabulary. Consequently, this results in better word embeddings for all the words in the vocabulary that do not have a temporal reference, because they are represented by using all occurrences in C 1 and C 2 . We suppose that these differences allow T R to achieve better results than the other models.   Tab 3 reports the best results for each language among all participants to Task 1. UWB obtains the best result for German language, tied with Life-Language and RPI-Trust, and the best average result over all languages. Our official submission T RI gives the best result in the Swedish language, whereas Jiaxin & Jian results first for Latin and NLPCR for English language. In Subtask 2, NLPCR and UWB obtain the best results for English and German languages respectively, confirming results obtained in Subtask 1. Concerning the Latin language, also Jiaxin & Jian confirm results obtained in Subtask 1, outperformed only by RPI-Trust, while in Swedish UWB obtain the best result. In general, each system achieved the best performance in one language while performing differently on the remaining others.
During the post-evaluation, we decided to investigate also the role of GMMs for class labeling (Sec. 2). We compared GMMs with semi-manual thresholds µ S , µ S − σ S , µ S + σ S and Winsorizing (Kokic and Bell, 1994) computing µ S and σ S on data provided for Subtask 1, where µ S and σ S are the mean and the standard deviation computed on the similarity set S. Figure 1 reports the different accuracy scores obtained by the five methods for the T RI, Collocation, DW 2V , T R approaches. The scores for the GMMs strategy are close to those obtained by µ S for TRI and Collocation. While GMMs outperforms µ S + σ S in every run, µ S − σ S seems to work better than GMMs except that in T R. Winsorizing works better than GMMs in T RI and Collocation. GMMs outperforms Winsorizing in DW 2V and T R. These results are not clear enough to advocate for a specific threshold. Consequently, further analysis will be part of future work in order to understand what is the better threshold that could be included in the GMMs process. Accuracy GMMs µ S µ S − σ S µ S + σ S Winsorizing Figure 1: Accuracy scores in Subtask 1 using different class labeling strategies: GMMs, µ S , µ S − σ S , µ S + σ S and Winsorizing using mean and standard deviation.