SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical Semantic Change

This paper describes SChME (Semantic Change Detection with Model Ensemble), a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME uses a model ensemble combining signals distributional models (word embeddings) and word frequency where each model casts a vote indicating the probability that a word suffered semantic change according to that feature. More specifically, we combine cosine distance of word vectors combined with a neighborhood-based metric we named Mapped Neighborhood Distance (MAP), and a word frequency differential metric as input signals to our model. Additionally, we explore alignment-based methods to investigate the importance of the landmarks used in this process. Our results show evidence that the number of landmarks used for alignment has a direct impact on the predictive performance of the model. Moreover, we show that languages that suffer less semantic change tend to benefit from using a large number of landmarks, whereas languages with more semantic change benefit from a more careful choice of landmark number for alignment.


Introduction
The problem of detecting Lexical Semantic Change (LSC) consists of measuring and identifying change in word sense across time, such as in the study of language evolution, or across domains, such as determining discrepancies in word usage over specific communities (Schlechtweg et al., 2019). One of the greatest challenges of this problem is the difficulty of assessing and evaluating models and results, as well as the limited amount of annotated data (Schlechtweg and Walde, 2020). For that reason, the vast majority of the related work in the literature pursue this problem from an unsupervised perspective, that is, detecting semantic change without having prior knowledge of "truth". The importance of such task is manifold: to humans, it can be a powerful tool for studying language change and its cultural implications; to machines, it can be used to improve language models in downstream tasks such as unsupervised word translation, and fine-tuning of word embeddings (Joulin et al., 2018;Bojanowski et al., 2019). In this task, the goal is to develop a method for unsupervised detection of lexical semantic change over time by comparing across two corpora from different time periods in four languages: English, German, Latin, and Swedish . Particularly, we are required to solve two sub-tasks: binary classification of semantic change (Subtask 1), and semantic change ranking (Subtask 2).
There are many ways in which a word may change. Specifically, a word w may change sense because it has been completely replaced by a synonym w s (lexical replacement), or because it gains a new meaning, in which case word w may keep or lose its previous meaning across time and domain (Kutuzov et al., 2018). Each type of change has its unique characteristics and may require different approaches in order to be detected. In this paper we describe a novel model ensemble method based on different features (signals) that we can extract from the text using distribution models (skip-gram word embeddings) and word frequency. Our model is primarily based on features extracted from independently trained Word2Vec embeddings aligned with orthogonal procrustes (Schönemann, 1966), such as cosine distance, but also introduces two novel measures based on second-order distances and word frequency. Based on the distribution of each feature, we predict the probability that a word has suffered change through an anomaly detection approach. The final decision is made by soft voting (averaging) all the probabilities. For binary classification (Subtask 1) a threshold is applied to the final vote, for ranking (Subtask 2), the output from the soft voting is used as the ranking prediction.
Our results show that second order methods and different combinations outperform the frequently used cosine distance in some subtasks and languages. Furthermore, we illustrate that the methods are sensitive to the degree of change in the language. It is possible to improve performance of these methods by aligning two embeddings of the same language from different time slices on a subset of words instead of all words. This opens a new avenue of research on finding optimal words for alignment. The code for the model can be obtained at https://github.com/mgruppi/schme.

Related Work
Most methods for detecting semantic change are based on the distributional property of word semantics. The general idea is to compute contextual information of word w in each time or domain, and apply a measure of difference or distance between the observed contexts of w. Some of the first methods for detecting semantic change compute context information using a co-occurrence matrix within a pre-defined window of size L (Sagi et al., 2009;Cook and Stevenson, 2010). This means that, for a vocabulary of size n, one computes a n × n matrix M where M i,j is the frequency in which word i and j co-occur within a window of L words. This often yields a highly sparse matrix M , which is typically reduced in dimensionality by techniques such as Singular Value Decomposition (SVD). Once the matrices are computed, the contextual difference is computed by the cosine distance between the vectors.
Distributed word vector representations such as the ones obtained by the skip-gram with negative sampling (SGNS) (Mikolov et al., 2013) are forms of learning distributional information without the need for computing sparse co-occurrence matrices. A work by Hamilton et al. (2016b) presents a method for detecting semantic change using SGNS word embeddings learned from each corpora and aligned with orthogonal procrustes. The semantic change is, again, computed by the cosine distance between vectors in each time/domain. In another study (Hamilton et al., 2016a), the authors introduce a measure of semantic change based on how the neighborhood of a word changes named Local Neighborhood Change based on the number of words in common.
To eliminate the need for alignment, several authors have proposed dynamic word embeddings techniques, which jointly learn distributional word representations using the assumption that words are connected across time (Bamler and Mandt, 2017;Rudolph and Blei, 2018;Yao et al., 2018). The main assumption in such methods is that word changes are considerably small between adjacent time stamps t 1 and t 2 , i.e. words evolve smoothly, thus word representations should be close between these periods. We argue that the assumption that all words in t 1 and t 2 should be smoothly connected through time does not always hold. This is because the corpora are aggregated over several years/decades/centuries, thus the semantic change may be drastic, and more similar to a cross-domain scenario than a diachronic one. We illustrate this by the corpora in this task and the use of a subset of landmarks for alignment that has not been investigated in the literature.

Model Overview and Data
The data provided in this task consists of two corpora for each language, each corpus corresponding to different time periods t 1 and t 2 , as well as a list of target words for which we have to predict binary class and rank with respect to magnitude of the semantic change between t 1 and t 2 . The corpora used for each language are summarized in Table 1.

Word Representations
Most of our features are based on the alignment of word embeddings. Thus, the first step of our system is to train a Word2Vec model on corpora C 1 and C 2 for each language, let W 1 and W 2 denote the resulting word embeddings, respectively. Since W 1 and W 2 are learned independently, we cannot directly compare their vectors. Hence, similarly to Hamilton et al. (2016b), we apply orthogonal procrustes (OP)

Language
Corpora CCOHA (Alatrash et al., 2020) 1810-1860 1960-2010 German DTA + BZ + ND ) 1800-19001946-1990 LatinISE ( KubHist (Borin et al., 2012) 1790-1830 1895-1903 Table 1: Data provided for the task. In addition to the corpora, a set of target words is given, for which we need to generate outputs in substasks 1 and 2. (Schönemann, 1966) to align the word embeddings of the corpora. Given matrices A and B, the objective of OP is to learn an orthogonal transformation matrix Q that minimizes the sum of squared distances AQ − B 2 . Because Q is orthogonal, the transformation AQ is only subject to rotation and reflection, which preserves the relationships between the word vectors in A. We learn the transformation matrix Q from the alignment of W 1 and W 2 , updating W 1 ← W 1 Q. Now the word vectors in W 1 can be directly compared to W 2 . In the following sections, we'll discuss the distance metrics used by the model to measure semantic change.

Distance Measures
Cosine Distance (COS). One of the most used metric for comparing word vectors is the cosine distance. The cosine distance between two vectors in a single source indicates how closely distributed the words are. In the semantic change scenario, we compute the cosine distance for word w as d cos = 1 − cos(v 1 , v 2 ), where v 1 and v 2 are the word vectors of w in W 1 and W 2 , respectively. Ideally, a small value of d cos would imply that the contexts for w is similar in both corpora C 1 and C 2 .
Mapped Neighborhood Change (MAP). This measure looks at how a word moves away from its neighborhood across both corpora. To that end, we compute a second-order cosine distance vector s 1 (v 1 , N 1 ) between v 1 and its k nearest neighbors in W 1 , which we'll denote as the set N 1 . Then we compute another second-order vector s 2 (v 1 , N 1 ) using v 1 but looking for corresponding vectors of each word in N 1 in the space of the second corpus W 2 . The mapped neighborhood change is then computed as the cosine distance d map (v 1 ) = d cos (s 1 (v 1 , N 1 ), s 2 (v 1 , N 1 )). Although this method uses second-order distances like the Local Neighborhood Change (LNC) (Hamilton et al., 2016a), it differs from it by computing the distances between the aligned input embeddings, while LNC only computes such distances for vectors within a single embedding matrix.
Frequency Differential (FREQ). Let f 1 and f 2 be the relative frequencies of word w in C 1 and C 2 . We define the frequency differential for w as f (w) = f 1 −f 2 f 1 +f 2 . Positive values indicate increase while negative values indicate decrease in frequency across the corpora. We argue that a steep increase in frequency may indicate indicate change more strongly than frequency decrease, which may happen due to a word becoming less popular or being replaced by another word without losing its original sense. This assumption is only viable because we know that C 1 always happens earlier in time than C 2 .

Model Ensemble
We compute the aforementioned features on all words in the intersection of the vocabularies of C 1 and C 2 , we use the observed feature distributions to determine potentially changed words. Let X i denote the random variable associated with the distribution of feature i. We work under the assumption that small values of X i denote little or no semantic change to a word. Moreover, unlikely high values of X i indicate a high chance that the word suffered change according to metric i. We define small and large values with respect to all the computed values in the distribution. For instance, if the cosine distance computed for a word is large when compared to the cosine distances of the other words, it is likely that the word has changed. Therefore, we define the probability of change for a word whose feature value is Thus, P i is the cumulative distribution function (CDF) of X i , describing how unlikely high x i is  according to the distribution of X i . We aggregate the probability output of each feature P i (x i ) by applying soft voting to each feature's prediction. The final prediction for a feature vector For classification, a threshold is a applied to P (x) in order to determine the class. For ranking, the score P (x) is used directly.

Evaluation
We conduct all the experiments on the data provided for SemEval-2020 Task 1 for all four languages. Given that most of the corpora have been pre-processed with lemmatization and tokenization, our preprocessing consists of removing words whose count is less than 10, and tokenizing words at spaces. In this section we present the experiments and results for the model submitted to the task, as well as additional analysis of the model parameters.
We begin by learning the distributional representations of words in each corpora using Gensim's (Řehůřek and Sojka, 2010) implementation of Word2Vec . The parameters for Word2Vec are: vector size d = 300, window L = 10, negative samples ng = 5, and minimum word count min wc = 10. Next, we align the learned word vectors via OP using the intersecting vocabulary as landmarks. Then, we compute the distance metrics and their distributions so that we can get the vote P r(X i ≤ x i ). Finally, we apply the model ensemble to different feature configurations to predict a final score. For classification, we apply a threshold t to the model output P (x), such that the predicted class is y = 1 if P (x) > t, and y = 0 otherwise. For ranking, the final score P (x) is used.
Since there was no validation data during the evaluation phase, our submissions included multiple feature and threshold settings. The feature configurations are combinations of the cosine distance (COS), mapped neighborhood distance (MAP), and frequency differential (FREQ). The applied threshold levels are {0.5, 0.75, 0.9}. Our team (RPI-Trust) ranked 4th place in Subtask 1 with a score of 0.660, and 6th place in Subtask 2 with a score of 0.427 in the evaluation phase.

Post-Evaluation
We evaluate our model on the provided test data in the post-evaluation phase. First, we fix a threshold of t = 0.75, then we use different feature combinations to evaluate the performance on each language. Classification results, seen in Table 2, show that there is no single best feature configuration for all languages. This may happen because each language evolved differently between t 1 and t 2 , and having each feature model being able to capture different types of change. For example, many events in between t 1 and t 2 for the English corpora may have contributed to the evolution of the language, such as the Second Industrial Revolution, and the World Wars. Technological development introduced several new concepts such as (air) plane and (record) player which were unheard of in t 1 , the detection of such change relies on signals that can indicate a completely new use of a word while potentially keeping its previous senses. The results for the ranking task are shown in Table 3. Notice that the best feature configurations for classification are not necessarily the best for ranking. MAP performs best for Latin which might be due to potential big semantic shift in this language which is better captured by incorporating neighborhood information. As seen in the decay column, COS and COS+MAP+FREQ (used in our submission) are the overall best performing methods across the two tasks.

Landmarks Are Important
When executing procrustes alignment, one must choose which and how many words to align on. Since alignment seeks to enforce short distances between landmark words, we hypothesize that this method may mask some of the semantic shift involving the landmark words. To test this, we analyze the effect of the number of landmark words over the model predictions by executing procrustes alignment at using the top n most frequent landmark words with n ∈ [300, N ], where N is the size of the intersecting vocabulary, keeping a classifier threshold fixed at t = 0.75. Figure 1 shows the results for all four languages.
These results present evidence to our argument: using more landmark words in the alignment procedure favors German and Swedish that likely have less semantic shift compared to Latin and English. Notice that both corpora present class imbalance leaning towards unchanged words, and show increased accuracy as the number of landmarks increase. On the other hand, the same is not true for English, which has more balanced classes, nor for Latin which is unbalanced towards changed words. In both these languages, the classification accuracy peaks at some n < N and then decreases, thus showing that using all possible words as landmarks may decrease the accuracy.  Figure 1: (a) Accuracy in Subtask 1 using different numbers of landmark words for each language. Notice how German and Swedish do not show a decrease in accuracy despite the large number of landmarks used, whereas English and Latin have optimal performance at some point before the maximum; (b) Ranking performance according to number of landmarks shows a different trend from that of the binary classification with Swedish decreasing in performance as the number of landmarks grow.

Conclusions
We presented a model for unsupervised detection of semantic change based on anomaly detection over a selection of features. SChME works directly on the input corpora, not requiring language-specific pre-trained models. The model ensemble is agnostic to the feature models, which means any measure of change could be easily incorporated to it, if desired. Our results show that the model parameters must be chosen carefully for each task and language. Particularly, we have shown that the choice of landmarks for alignment is strictly related to the degree of change of a language. In future work, we plan on addressing this issue by developed principled ways of choosing the words to align so that the semantic change is revealed more accurately.