IMS at SemEval-2020 Task 1: How low can you go? Dimensionality in Lexical Semantic Change Detection

We present the results of our system for SemEval-2020 Task 1 that exploits a commonly used lexical semantic change detection model based on Skip-Gram with Negative Sampling. Our system focuses on Vector Initialization (VI) alignment, compares VI to the currently top-ranking models for Subtask 2 and demonstrates that these can be outperformed if we optimize VI dimensionality. We demonstrate that differences in performance can largely be attributed to model-specific sources of noise, and we reveal a strong relationship between dimensionality and frequency-induced noise in VI alignment. Our results suggest that lexical semantic change models integrating vector space alignment should pay more attention to the role of the dimensionality parameter.


Introduction
Lexical Semantic Change (LSC) Detection has drawn increasing attention in recent years (Tahmasebi et al., 2018;Kutuzov et al., 2018). SemEval-2020 Task 1 provides a multi-lingual evaluation framework to compare the variety of proposed model architectures . An important component of high-performance LSC detection models is an alignment method to make semantic vector spaces comparable across time. In this paper we focus on a particular alignment method for type embeddings, Vector Initialization (VI), and how its performance interacts with vector dimensionality. We compare VI to two further state-of-the-art alignment methods, Orthogonal Procrustes (OP) and Word Injection (WI), which have shown high performance in previous studies (Hamilton et al., 2016b;Dubossarsky et al., 2019) and are also used in the top-ranking systems for Subtask 2. A systematic comparison of performance across dimensionalities d reveals that the optimal d of the models on the SemEval test data is lower than in standard choices, and that VI's performance strongly depends on d, showing large drops for high dimensionalities. We demonstrate that this effect is correlated with the amount of frequency noise picked up by VI, i.e., the degree to which cosine distances between vectors reflect frequency differences between words rather than semantic differences. If properly tuned regarding dimensionalities and noise, VI outperforms OP and WI as alignment method.

Related Work
The semantic representations we test fall into the large body of work on distributional semantic vector space models (Turney and Pantel, 2010) and represent specific instances of type-based word embeddings (Mikolov et al., 2013a). The need for vector space alignment in LSC detection is shared with bilingual lexicon induction (Ruder et al., 2019) and term extraction (Hätty et al., 2020) where corpus-specific semantic representations need to be mapped to common coordinate axes.
Alignment techniques introduce varying levels of noise (Dubossarsky et al., 2019), and the noise level (the signal-to-noise-ratio) determines the optimal dimensionality of word embeddings (Yin and Shen, 2018). We regard information contained within the semantic representation capturing anything but semantic relations between words as noise (e.g. word frequency). Sources of noise include the corpora, the representation method and alignment techniques. Consequently, a specific semantic representation learning algorithm (such as Skip-Gram with Negative Sampling) may have a different optimal dimensionality depending on the alignment technique it relies on. Up to now, previous research on LSC detection has not paid much attention to this relationship between dimensionality and noise: models with different susceptibilities to noise have typically been tested without varying the dimensionality (Hamilton et al., 2016b;Dubossarsky et al., 2019;Shoemark et al., 2019).

System overview
Most models in LSC detection combine three sub-systems: (i) creating semantic word representations, (ii) aligning them across corpora, and (iii) measuring differences between the aligned representations . Semantic representations can either be token-based, keeping one representation (e.g. a vector) per word use (Hu et al., 2019, e.g.), or type-based, collapsing information from different uses into one representation (Hamilton et al., 2016b, e.g.). The alignment step is needed mostly for vector space models, which might otherwise introduce arbitrary orthogonal transformations to the vector spaces they produce (Hamilton et al., 2016b).
Our system focuses on the type-based Skip-gram Negative Sampling (SGNS) with Vector Initialisation alignment (VI) and Cosine Distance (CD). We chose this method due to its surprisingly good performance with d = 5 in a student shared-task project (Ahmad et al., 2020)

Semantic Representation
SGNS is a shallow neural network trained on pairs of word co-occurrences extracted from a corpus with a symmetric window. It represents each word w and each context c as a d-dimensional vector to solve where σ(x) = 1 1+e −x , D is the set of all observed word-context pairs and D is the set of randomly generated negative samples (Mikolov et al., 2013a;Mikolov et al., 2013b;Goldberg and Levy, 2014). The optimized parameters θ are v w i and v c i for i ∈ 1, ..., d. D is obtained by drawing k contexts from the empirical unigram distribution P (c) = #(c) |D| for each observation of (w, c), cf. Levy et al. (2015). After training, each word w is represented by its word vector v w . To keep our results comparable to previous research (Hamilton et al., 2016b; we chose common settings for most of the hyperparameters. We decided on a symmetrical context window of size 10, initial learning rate α of 0.025, number of negative samples k = 5 and no sub-sampling. Depending on corpus size we trained the model for either 5 (German, Swedish) or 30 epochs e (English, Latin). 3 As we focus on the effect of dimensionality, each experiment was performed for each d ∈ {5, 10, 25, 50, 80, 150, 200, 250, 300, 350, 500, 750, 1000}. Prior to the shared task application we validated all models with these hyper-parameters on the German DURel dataset (Schlechtweg et al., 2018).

Alignment
Vector Initialisation. In VI we first train the SGNS model on one corpus and then use these vectors to initialize the vectors for training on the second corpus (Kim et al., 2014). The motivation of this procedure is that if a word is used in similar contexts in both corpora, the second training step will not change the initial word vector much, while more different contexts will lead to a greater change of the vector. SGNS represents each word by two vectors, a word vector and a context vector. The former is modified when a word occurs as target w in a target-context pair (w, c), while the latter is modified when it occurs as context c. While  only initialize the word vectors on the first model and context vectors randomly, we also initialize context vectors on the first model, as done by Ahmad et al. (2020). In this way, we expect to introduce considerably less noise to the vectors in the second corpus.
Orthogonal Procrustes. SGNS is trained on each corpus separately, resulting in matrices A and B. To align them we follow Hamilton et al. (2016b) and calculate an orthogonally-constrained matrix W * : where the i-th row in matrices A and B correspond to the same word. Using W * we get the aligned matrices A OP = A and B OP = BW * . Prior to this alignment step we length-normalize and mean-center both matrices (Artetxe et al., 2017;. Word Injection. The sentences of both corpora are shuffled into one joint corpus, but all occurrences of target words are substituted by the target word concatenated with a tag indicating the corpus it originated from (Ferrari et al., 2017;. This leads to the creation of two vectors for each target word in one vector space, while non-target words receive only one vector encoding information from both corpora. This is very similar to Temporal Referencing (TR) (Dubossarsky et al., 2019), the difference being that with TR, the target-context pairs used for training never contain tagged target words as contexts but rather the genuine (untagged) words.

Measures
To quantify semantic change on the aligned vector representations, we use two vector similarity measures. For Subtask 1 we apply Local Neighborhood Distance (LND), as it showed superior performance to CD for binary change detection in Schlechtweg and Schulte im Walde (2020). LND is based on second-order cosine similarity and measures to which extent x and y's distances to a union of their k nearest neighbors differ (Hamilton et al., 2016a). Similar to Hamilton et al. (2016a) we chose k = 25. We split the LND scores into two equally sized groups; the group containing the high values was labelled as 1 (change). For Subtask 2 we use a simple cosine distance (Salton and McGill, 1983).

Experimental setup
SemEval-2020 Task 1 comprises a binary classification task (Subtask 1) and a ranking task (Subtask 2) on data from four languages: English, German, Latin and Swedish . Subtask 1 asks participants to decide which target words lost or gained senses between corpora from two time periods t 1 and t 2 , and which ones did not. Subtask 2 asks participants to rank a set of target words according to their degree of LSC (change in sense frequency distribution) between t 1 and t 2 . The tasks are different in that it is possible for a word to show a high degree of LSC in Subtask 2, while not gaining or losing a sense in Subtask 1 (or vice versa). For example the German word abgebrüht is used to describe (1) the process of cooking food in water and (2) an emotionally insensitive person. Both senses are present across time periods t 1 and t 2 , but sense 1 dominates period t 1 while sense 2 dominates t 2 . Thus, the number of senses has not changed, but the word has undergone significant semantic change.
The four languages have a list of 31 to 48 target words, each annotated with values for Subtasks 1 and 2. Performance of a model is measured by accuracy and Spearman's rank-order correlation coefficient. For each of the four languages two corpora (corpus 1 , corpus 2 ) are provided by the organizers, containing sentences from different time periods. These corpora show strong differences in terms of size, time-period and genre (see Appendix A), posing a very heterogeneous, challenging setting for evaluation and parameter tuning. 4 Table 1 lists the evaluation phase scores of the top three contenders for both subtasks as well as our system. During this phase, submission scores and leaderboards were hidden. At the end of the evaluation  phase the best submission (out of a maximum of ten) was put on the leaderboard for both subtasks. We only submitted results for VI with d/e of 5/30, 3/50 and 8/30. The very low choice for d is motivated by the results found in Ahmad et al. (2020), where VI has high performance with d = 5. With this exceptionally low d we scored 13th in Subtask 1 and 8th in Subtask 2 out of 33 teams. Our methodology for Subtask 1 has much room for improvement but we decided to point our attention towards Subtask 2 during post-evaluation. The three best teams for Subtask 2 used models based on OP and TR/WI. Their average scores are very similar and ahead of ours. The models seem to consistently get the best performances on German, then Swedish, followed by English and Latin. With this limited picture VI alignment seems to be inferior to OP and WI. But after tuning each model to its optimal d, performances are barely distinguishable (see Table 1, Post-Evaluation). This is most prominent for Latin. We attribute the low score (0.10) during the Evaluation phase to the Latin data-set being very challenging due to its size and heterogeneity in combination with high discrepancies between the number of vector updates in the second training step (see below). Performance significantly improved after switching training order (see Appendix C).

Analysis
We now tested the performance of VI, OP and WI on the four languages with varying dimensionality. Training epochs were adjusted to compensate differences in corpus size: German and Swedish were trained for 5 epochs, Latin and English for 30 epochs. We realized that VI is very sensitive to training order (see Appendix C): this was very prominent with Swedish and Latin, as corpus 1 and corpus 2 show big differences in size. To get comparable results, we switched training order for these two languages, which means instead of first training on corpus 1 and then corpus 2 , we train on corpus 2 and then corpus 1 . As shown in Figure 1 all models were able to achieve high correlations for German and Swedish (0.6-0.7). On English and Latin correlation was lower (0.3-0.4), which is probably related to corpus size, and performances are barely distinguishable, see Figure 1 bottom. For German, VI and WI show a clear global peak in performance at d = 50, while for other languages we find several local peaks. In German, highest performance is obtained by VI, closely followed by WI, while in Swedish OP is best. In all languages performance of OP is very consistent across different d > 25, and it is the most robust model in high d. WI also shows robust performance in high d, but is mostly outperformed by OP. In all languages, we see a steep drop-off in performance of VI with higher d.
What is each model's optimal dimensionality? For VI and WI the optimal d in our models is much lower than the common choice of 200-300 (Hamilton et al., 2016b;Dubossarsky et al., 2019;Shoemark et al., 2019). 5 In Swedish, which has the largest training corpus, optimal ds are higher, suggesting a possible correspondence between corpus size and optimal d.
Across all corpora, OP tends to have higher optimal values than VI and WI. This may be because orthogonal alignment works better in high dimensions, as there are more degrees of freedom to rotate the vectors. In Swedish, all methods show two local maxima instead of a global one. This behavior becomes more pronounced with increasing numbers of training epochs (see Figure 2). This could be explained by the two Swedish corpora having different optimal ds (due to their different sizes and homogeneity).
Why does VI's performance drop in high d? We tested several hypotheses to explain the drop. We first speculated that the embeddings for corpus 2 generally drift away from their initial state, as seen in (Kim et al., 2014) where even the word with the least amount of measured semantic change had cosine distance scores of almost 0.1. This drift could be repaired by post-hoc alignment. However, additional OP alignment did not eliminate the drop. We then tested the hypothesis that the number of training updates influences cosine distances . Hence, we calculated the correlation between the predicted ranking of target words (according to cosine distance) and their frequency (reflecting the number of training updates) in the second corpus, and compared this correlation across d (see Figure 2, top). There is a clear frequency bias for VI becoming stronger with increasing d. Thus, the predicted cosine distances do not reflect LSC but rather frequency, leading to poor performance. The bias correlates negatively with performance as comparing the top to the bottom figures. Interestingly, the number of training epochs also has a strong effect on this bias: An increasing e reduces the bias for low d and consequently drastically improves performance (see Figure 2, cf. top and bottom). We were not able to find an explanation for the cause of this behaviour.
The extend of the frequency bias is determined by several parameters, the main one being word frequency in the second corpus. We experimented with modified corpora wherein the frequency of all target words was fixed to 200, and less frequent target words were ignored. This completely removed the frequency bias. However, if frequency differences amongst target words exist, as is the case with most data sets, the noise induced by those differences may be exaggerated by dimensionality or reduced by number of training epochs. Understanding and explaining the frequency bias is outside of the scope of this work, but will be part of future work.

Conclusion
Our shared task system investigated Vector Initialization (VI) alignment in a commonly used LSC detection model based on Skip-Gram with Negative Sampling, while focusing on the role of vectorspace dimensionality. Our results suggest that LSC detection models integrating vector-space alignment should pay more attention to model-specific characteristics and the dimensionality parameter in particular. Current state-of-the-art models are often dominated by applying OP and WI to alignment, as a wide variety of reasonable parameters yield good results, whereas VI is very susceptible to parameters like training order, dimensionality and epochs. However we demonstrate that VI is able to outperform OP and WI alignment if tuned properly. Due to time limitations we could not fully explore the effects of epochs on VI, which have proven to play a significant role for dimensionality-dependent performance. Future work will include a closer look at the connection between dimensionality, frequency noise and training epochs.

C Influence of training order
We switched the training order for Swedish and Latin due to their size differences across corpora. The switch led to noticeable increases in performance for both languages, see Figure 3.