TemporalTeller at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection with Temporal Referencing

This paper describes our TemporalTeller system for SemEval Task 1: Unsupervised Lexical Semantic Change Detection. We develop a unified framework for the common semantic change detection pipelines including preprocessing, learning word embeddings, calculating vector distances and determining threshold. We also propose Gamma Quantile Threshold to distinguish between changed and stable words. Based on our system, we conduct a comprehensive comparison among BERT, Skip-gram, Temporal Referencing and alignment-based methods. Evaluation results show that Skip-gram with Temporal Referencing achieves the best performance of 66.5% classification accuracy and 51.8% Spearman’s Ranking Correlation.


Introduction
Task 1 in SemEval 2020 is Unsupervised Lexical Semantic Change Detection. Semantic change is defined as the changes in lexical meaning rather than grammatical usage of a language symbol (Bloomfield, 1933). It has become an increasingly common phenomenon under the influence of linguistic habits, social environments and many other factors. For example, the word gay previously meant being happy but now mostly means homosexual. Those changes are indicators of temporal ideologies as well as language development trends. Therefore, studying the evolution of word meanings can assist the research in human language, historical information retrieval and arguably other disciplines.
Recently, there have been many studies on this topic. One direction is word level change detection. Those approaches seek to represent each word as a single vector that reflects its overall meanings. Some works (Sagi et al., 2009;Gulordava and Baroni, 2011;Rodda et al., 2016;Kahmann et al., 2017) use co-occurrence matrices with SVD. Others (Kim et al., 2014;Rosenfeld and Erk, 2018;Hamilton et al., 2016a) adopt neural embeddings such as Skip-gram with negative sampling (SGNS) (Mikolov et al., 2013a). However, those word embeddings need to be aligned to the same vector space, which is shown to introduce noise (Dubossarsky et al., 2019). Another direction is detecting semantic changes from the sense level. These models investigate the different meanings of words, thus can compare in a fine grained manner. In this paper, we will focus on word level change detection.
We spotted two problems among those related works. First, no classification mechanisms have yet been discussed. All papers calculate the lexical semantic change, but none has proposed a systematic approach to classify between changed and stable words. The second problem is that the models may be potentially out-of-date. While the leaderboards of many NLP tasks such as reading comprehension are continuously refreshed by new models like Pre-training of Deep Bidirectional Transformers (BERT) (Devlin et al., 2018), the methods adopted in this topic are relatively conventional. Though it is not necessarily a problem, we are curious to know whether the latest models could bring extra benefits.
Aiming at the first problem, we propose Gamma Quantile Threshold (GQT). We propose to describe the cosine distances between word representations using Gamma distribution. The threshold for classification will be set as the quantile of the distribution. For the second problem, we introduce contextualized embeddings using BERT (Devlin et al., 2018). As a comparison, we also leverage temporal referencing (Dubossarsky et al., 2019), a promising method in this field. Finally, we designed TemporalTeller, a framework that integrates learning word representations, alignment and threshold calculation for semantic change detection. It can learn word representations via SGNS and BERT with or without temporal referencing. Experiment results show that GQT works well with the existing word embeddings methods, and temporal referencing with SGNS achieves the best performance in both classification and ranking subtasks.
Our contributions can be summarized as follows: • We propose Gamma Quantile Threshold, which is the first algorithm to fill the gap between quantifying and classifying lexical semantic change. • We introduce contextualized embeddings using BERT (Devlin et al., 2018) to the semantic change detection problem. We conducted a comprehensive comparison between BERT and other methods. • We designed TemporalTeller, a framework that integrates the common semantic change detection pipelines. Under this framework, we performed a large scale study on the effects of hyperparameters.
2 System Overview 2.1 Temporal Referencing (TR) Temporal Referencing(TR) was created for term extraction (Ferrari et al., 2017), and later proven useful for lexical change detection (Dubossarsky et al., 2019). The idea is to tag each target word to indicate which corpus it comes from. For this task in particular, we add a suffix new and old to the target words from the modern and ancient corpus respectively. Then the word embeddings are learned from the merged corpus that includes both the modern and ancient one. After learning, the embeddings of target new and target old are in the same vector space, thus can be compared directly. TR is based on the assumption that the meanings of the majority of context words are stable, so that the change of target words can be highlighted in this way. TR brings several advantages. First, TR eliminates alignment and thus reduces noise. Second, TR lowers the data requirement because all corpora can contribute to the word representations collaboratively. Finally, it is widely compatible with any embedding based methods. It can also be extended to the scenarios with multiple corpora from multiple periods.

Skip-Grams with Negative Sampling (SGNS)
SGNS tries to maximize the similarity between co-occurring words while minimize that of non-cooccurring words by optimizing where σ is the sigmoid function, v is the word vector, D is the whole set of word-context pairs, and D is the randomly picked negative word-context pairs (Levy and Goldberg, 2014).

BERT Contextualized Embedding
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) is a powerful language model trained on unlabeled text, and then fine-tuned for downstream tasks with a classifier added.
As suggested by its authors, we use the sum of the last 4 layers to encode both word meaning and context information. For each target word, we input each occurrence with context to obtain its contextualized embedding. The representation of a word is the average of the embedding of all its occurrences. In order to reduce the noise from context, we propose to set a context window and remove stop words and rare words inside.

Alignment with Orthogonal Procrustes (OP)
Orthogonal Procrustes (Hamilton et al., 2016b) is also based on the stable context assumption. OP maps word embeddings learned at year a, V a , into V a W such that ||V a W − V b || 2 is minimized: where R is a binary matrix with R i,j = 1 if and only if the i th word in the first corpus is the j th word in the second one. The solutions can be obtained by first performing SVD V T b RV a = U ΣV T and then W * = U V T Schönemann, 1966).

Gamma Quantile Threshold
We observe that the cosine distances (CD) between old and new word vectors of target words resemble Gamma distribution: the majority of data lies together, forming a peak, while larger CDs form a long tail at the right side. We assume those at the peak are stable, while those at the tail are changed. Thus, we propose to compute a global threshold by fitting the CDs to a Gamma distribution using maximum likelihood estimation. Based on the experiment results, we choose the 75% density quantile as the threshold. In Gamma distribution, 75% quantile cuts at its long right tail.
Different from the quantile of data, quantile of distribution relies less on the selection of target words. In Figure 1a, for example, the number of large CDs lying at the tail is clearly smaller than what its Gamma density function expects. As a result, the fraction of English target words classified as changed is the least among the four languages. Figure 2 shows our TemporalTeller framework. The system integrates TR preprocessing, learning word embeddings, alignment, calculating distance and threshold.  The dataset of this task covers English, German, Swedish and Latin (Schlechtweg et al., 2020). For each language, there are two corpora collected from ancient and modern documents respectively. The details are shown in Table 1. There are two subtasks. The first is distinguishing between change and stable words. It is evaluated by the accuracy (ACC) against the truth annotated by human. The second is ordering the target words with respect to their extent of change. It is evaluated by Spearman's rank-order correlation coefficient (SPR) against the manually annotated rankings. We use the average performance over the four languages as the formal measure for each subtask.

Hyperparemeters and Implementation Details
Our implementation of SGNS with OP and TR is based on the repository of Dubossarsky et al (2019) 1 . We set the context distribution smoothing factor α = 0.75 and number of negative samples k = 5 as suggested by previous works (Mikolov et al., 2013b;Levy et al., 2015). The word embeddings are trained for 10 epochs with the words occurring less than twice omitted. We will study the effects of different context window sizes l and embeddings dimensions d.    Table 2, we can see that SGNS+TR achieves decent result, but is relatively sensitive to hyperparameters. It yields the best classification accuracy when d = 100, l = 10 and best SPR when d = 50, l = 5. This shows that by looking at a wider range of neighbours with a slightly larger embedding dimension, SGNS could better distinguish stable words from changed ones. In comparison, a medium context window l = 5 lead to better ranking. It also shows that l = 2 may miss some useful contexts while l = 10 may include too many irrelevant contexts that impacts ordering. This experiment suggests that the setting of fixing d at 300 in some previous works Hamilton et al., 2016b;Dubossarsky et al., 2019) may be suboptimal. It also contrasted the finding that l = 2 yields the best results (Levy et al., 2015).
Under the optimal hyperparameters learned above, we compare performance of SGNS+OP and SGNS+TR. Results in Table 3 show that TR is a clear winner. It also confirms that alignment introduces extra noise, and eliminating alignment yields better performance.  We propose to use BERT with two add-ons in this task. The first one is contextualization. Normally, the sum of the last 4 layers of the fine-tuned BERT is used as the word embedding. To further exploit the context information, we propose to input each occurrence of a target word along with its neighbors into the fined-tuned model and retrieve the embedding of that particular instance. Then the overall word embedding is obtained by averaging its all instance embeddings. Furthermore, we propose to remove unimportant words from the context window for more concise and refined information. For all languages, we define unimportant words as the those occur less than 30 times in the merged corpus and those are stop words according to Natural Language Toolkit (Bird and Loper, 2004). One thing to notice is that temporal referencing is still meaningful when using BERT with contextualization. With the temporal tags, all the target words are treated as brand-new words whose representations are to be learned from scratch. It blocks the impact of the semantic information from BERT pre-training materials (which are mostly modern corpus), thus better highlights the difference between ancient and modern senses.

Methods
After BERT is fine-tuned on the merged corpus, experiment (a) follows the normal setting without any add-ons. In experiment (b), we add contextualization. In (c), we further remove unimportant words from the contexts at the inference stage.   Table 4 presents a notable improvement from (a) to (b), which proves the effectiveness of contextualization. The comparison between (b) and (c) reveals that removing unimportant words also helps. However, BERT is not as good as SGNS+TR. Reasons might include the fact that SGNS is more focused on generating word vectors, which gives it a privilege in terms of semantic representation. It is also noticed that fine tuning BERT on this dataset is anything but easy. On one hand, the target words with temporal referencing tags are new vocabularies that need more iterations to learn. On the other hand, the corpora are not suitable for BERT as they are noisy and relatively short. According to our experiments, overfitting BERT model on this dataset can lead to catastrophically worsened performance. As a result, it is hard to balance between learning new words and reusing the linguistic information in pretrained models.
In addition, we also test the quality of thresholds under different quantiles. It is observed again that 75% is the optimal choice. quantile 0.65 0.70 0.75 0.80 ACC average 0.658 0.658 0.664 0.645 Table 5: Comparison of different quantiles in experiment (c) Table 6 summaries the scores and rankings of our system by the time the paper is submitted.

Conclusion
In this paper, we systematically studied unsupervised lexical semantics change detection from a computational perspective under our TemporalTeller framework. We proposed the Gamma Quantile Threshold, which is the first algorithm to fill gap between quantifying semantic change and classifying change status. We exploited the latest advances in temporal referencing. Furthermore, we introduced BERT into this semantic change detection task. We conducted a comprehensive comparison among these methods in terms of hyperparameters and performance. We found that the combination of SGNS with temporal referencing yields the best performance. However, it is sensitive to hyperparameters including embedding dimension and context window size. A suboptimal setting could lead to considerable performance degeneration. BERT can give comparable results with base model hyperparameters, but the uncontrollable training and a diverged focus prevents it from being the champion. Even with contextualized embeddings and unimportant words removal, it fails to prevail over SGNS with TR. It shows that the advantages of powerful general language models like BERT does not naturally extends to this task.  Figure 5 shows Gamma distribution also fits the cosine distances very well in BERT + TR. In Figure  6 (embeddings projected by PCA), the distance between bag nn old and bag nn new is smaller and their contextualized embeddings overlap. Looking at their contexts, we can see both bag nn old and bag nn new refer to a container used for carrying things. Some relatively faraway occurrences of bag nn new is about plastic, which dose not overstep too much. On the contrary, the distance between fiction nn new and fiction nn old is much larger. For fiction nn, its modern meaning is close to science fiction and contemporary magazines, while its ancient meaning is more related to romantic. Indeed, bag nn is classified as stable and fiction nn changed.