EmbLexChange at SemEval-2020 Task 1: Unsupervised Embedding-based Detection of Lexical Semantic Changes

This paper describes EmbLexChange, a system introduced by the “Life-Language” team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as the divergence between the embedding based profiles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames t_1 and t_2). The underlying assumption is that the lexical-semantic change of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We show that using a resampling framework for the selection of reference words (with conserved senses), we can more reliably detect lexical-semantic changes in English, German, Swedish, and Latin. EmbLexChange achieved second place in the binary detection of semantic changes in the SemEval-2020.


Introduction
SemEval 2020 Task 1 is defined on the unsupervised detection of word sense changes over time in German, English, Swedish, and Latin. In particular, this challenge focused on detection and quantification of the sense changes of word w in the transition from the time period t 1 to the time period t 2 in the above mentioned four languages, where the input for each language are the text corpora dating from t 1 and t 2 . This challenge involved two subtasks: i. Classification: The goal of the classification task is the binary detection of lexical semantic change implying loss or gain of senses (from t 1 to t 2 ) for the given word w.
ii. Ranking: This subtask involves the ranking of lexical-semantic change for a given list of words (w 1 , w 2 , . . . , w M ) by assigning scores quantifying relative changes of the word senses.
In order to evaluate the two subtasks, the participating systems are evaluated against a ground truth corpus annotated by native speakers or scholars of the respective languages .
Human languages constantly change due to cultural, technological, and social drift. Lexical semantic changes of human languages can materialize in the form of introducing/borrowing new words, or for the existing words can involve acquiring/losing some word senses (Koch, 2016;Traugott, 2017). Computational methods for automated detection of semantic changes can be extremely helpful in the study of historical texts or corpora spanning a very long period of time, e.g., semantic analysis of 1000 years of poetry (Asgari and Chappelier, 2013;, or in the design of the OCR algorithm for text digitization, or in designing an information retrieval system incorporating the semantic changes (Tahmasebi et al., 2018). Applications in the study of historical texts aside, the proposed methods detect lexical-semantic drift also in the same time period for different domains. This can be useful for compiling glossaries and specific training material in certain industries where new senses are introduced for words as compared to their standard usage e.g. to facilitate a more efficient training for new employees.
In the past decade, a variety of methods were introduced in the literature for automatic detection of lexical-semantic changes (Tahmasebi et al., 2018), where we only can refer to a subset of work, including but not limited to (i) co-occurrence-based methods (Sagi et al., 2009;Basile et al., 2016), (ii) embeddingbased approaches (Bamman and Crane, 2011;Kim et al., 2014;Kulkarni et al., 2015;Asgari and Mofrad, 2016;Hamilton et al., 2016a;Asgari, 2019;Asgari et al., 2020), and (iii) topic-models-based (Frermann and Lapata, 2016) approaches. In this paper, we extend our recently introduced DomDrift embeddingbased approach for the detection of semantic changes (Asgari et al., 2020) introduced for the extension of the computational analyses on 1000+ languages (Asgari and Schütze, 2017). Similar our earlier work on WELD (Asgari and Mofrad, 2016) and similar to (Hamilton et al., 2016a), DomDrift works based on a comparison of relative distances of words in the embedding spaces of the source and the target domains. To increase the stability of detection in DomDrift, we extend DomDrift to EmbLexChange by the following modifications: (i) instead of creating word profiles against all common words between the source and target domain, we use only a subset of pivot words, which are frequent words with unchanged relative frequencies. (ii) We create multiple word profiles by resampling from a set of pivot words. We show that the EmbLexChange can reliably detect the lexical-semantic changes in English, German, Swedish, and Latin achieving an average accuracy of 0.686 as second best system of the competition where the first place system achieved an accuracy of 0.687.

System overview
Here we detail the steps of the EmbLexChange system, where the overview is depicted in Figure 1. The EmbLexChange framework is developed based on the following assumptions: H1: frequent words change at slower rates (Hamilton et al., 2016b;Dubossarsky et al., 2017). H2: the relative frequency of unchanged words is not dramatically different in different time periods/domains. H3: changes of the word sense are reflected in the context, which are captured by the embedding model resulting in changes of the neighbors in the embedding space. Thus, the relative drift of a query word (a word which we target to investigate its lexical semantic change) with respect to unchanged words in the embedding space can characterize the lexical-semantic change.
1.1 Train embedding space for t 1 1.2 Train embedding space for t 2 The relative location of query in Ω t2 The relative location of query in Ω t1 Input corpus time period t2 Ct2 Figure 1: Overview of the EmbLexChange system for unsupervised detection of lexical-semantic changes. The steps are detailed in the §2.1.

EmbLexChange
1. Training language-model-based embedding spaces: The training of word embeddings using a language modeling objective (e.g., skip-gram) has shown to preserve the syntactic and the semantic regularities in the vector space (Mikolov et al., 2013;Pennington et al., 2014). Semantic changes impact the neighborhoods in the embedding space (H3). Thus, the first step to investigate change is to train embeddings separately for the text corpora in time periods t 1 and t 2 (steps 1.1 and 1.2 in Figure 1). In order to generate the embedding space Ω t , the only necessary resource is the raw text. For embedding creation, we use fasttext (Bojanowski et al., 2017) which leverages subword information within the skip-gram architecture. Using sub-word information minimizes the Out-OF-Vocabulary problem for query terms (Bojanowski et al., 2017). The result of this step are separate embedding spaces Ω t 1 and Ω t 2 for the time periods t 1 and t 2 , 2. Selection of fixed words and prepare pivot sets: To measure the degree of semantic change for the given query words in Ω t 1 and Ω t 2 , we need some fixed points, called pivot set V P comprising words with the property that their semantics are not dramatically changed and their relative positions in Ω t 1 and Ω t 2 remain almost constant (step 2 in Figure 1). For this purpose, based on H1 and H2, we propose the use of frequent words with their relative frequency higher than α in both time periods of t 1 and t 2 . Secondly, we filter this set by removing words whose relative frequency has changed between t 1 and t 2 , resulting in V P . These fixed points are then used to create query profiles in t 1 and t 2 . In order to increase the reliability and make variance analysis feasible, we execute N resamples each containing M words from 3. Query profiles creation: In the next step, for each query word, we create t 1 and t 2 profiles based on the pivot resamples (step 3 in Figure 1). The profile in time t is an l1 normalized embedding similarity vector of the query to the terms in V (i) P : P is the i th resample from the pivot set (created in step 2), w q is the query word, w k is the k th word in the i th resample, − → w is the vector representation of word w in the embedding space Ω t , φ is the temperature of the softmax function (used as a hyper parameter). Hence, for each V (i) P we can create one profile in t 1 and one profile in t 2 . 4. Profile divergence calculation: Next, for each resample V (i) P we calculate the divergence between the profile in the time period t 1 and the time period t 2 using KL-divergence: We average the λ i s over N resamples as the measure of semantic change for the query word w q . Since D KL does not have an upper bound, we estimate an upper bound based on λ i 's on resamples of a large set of randomly selected words V explore including V P and a set of words with a change in their relative frequency. We draw K resamples of size M words from V explore and calculate the λ k 's (λ's of words in k th resample of V explore ). We select the average of the 90 th percentile over K resamples as the upper bound and the average of the 10 th percentile as the lower bound of the λ to scale any calculated λ i for a query word toλ i (0 ≤λ i ≤ 1). Considering a threshold of h, we assignλ > h to the category of lexical semantic change, which can be adjusted as a hyperparameter on a validation set.

Data
The dataset used in this shared task includes corpora of English, German, Latin, and Swedish texts. For each language, the text corpora of two time periods are given. More details on the exact time frames and data sizes are provided in Table 1.

Experiment
The goal of SemEval task 1 is to detect the words with a change in their semantics in the transition from the time period t 1 to the time period t 2 in English, German, Swedish, and Latin languages. We closely follow the steps described in §2.  1. Language-model-based embedding setup: We train fasttext (Bojanowski et al., 2017) embeddings using the skip-gram architecture for each pair of language and time period separately. In the training of fasttext, we set the window size to c = 7 and the embedding size of d = 100. In the presence of a validation set, both c and d can be optimized as the hyper-parameters for each setting. It is known that a larger c is favorable for semantics representation of words and a smaller c for a syntaxrelated representation (Lison and Kutuzov, 2017).

Pivot resamples creation:
We firstly prepare a set of frequent words existing in both t 1 and t 2 for each language considering the α (relative freq.) as a way to select the top frequent words. Next, we filter this set to keep the words with the property that the ratio of their normalized frequencies is not substantially changed in t 1 and t 2 , 2 3 < f req t1 (w) f req t2 (w) < 3 2 resulting in our V P set. Subsequently, we draw N = 10 resamples from V P with the size of M = 5000 for each language. 3. Query profiles creation: In the next step, as presented in §2 step 3, for each query word we create t 1 and t 2 profiles for each of the N = 10 pivot resamples as in the previous step. 4. Profile divergence calculation: Next, for each of the N resamples V (i) P , we calculate the λ i and scale them toλ i using K = 5 resamples of size M = 5000 words from V explore . Subsequently, for the binary detection of changes, we apply different thresholds h over the average of scaled divergences (the average ofλ i 's =λ ) of V (i) P 's and assignλ > h to the category of lexical semantic change. Evaluation: For evaluation purpose, in the case of binary detection (Subtask 1) the accuracy metric is used to compare the given ground-truth and the predicted lexical semantic changes. For the ranking setting (Subtask 2), we report both the Spearman rank-order correlation coefficient (proposed by the task organizers) and Kendall-τ (with more accurate p-values for the smaller sample sizes (Bonett and Wright, 2000)) to measure the correspondence between calculated divergences (λ's) and the provided groundtruth scores. In addition, we repeat the experiments without resamplings (N = K = 1) to investigate the effect of resampling.

Results
The results of EmbLexChange in the detection and the quantification of changes in the lexical semantics in English, German, Swedish, and Latin are provided in Table 2. After the competition, we had the chance to perform further optimizations of the hyperparameters leading to the current results, slightly improved from those submitted to the competition leaderboard. The scores reported by the organizers during the evaluation phase are also provided in bold in parentheses. Our results show that resampling improves both accuracy and Spearman's rank correlation coefficient in all 4 languages. The EmbLexChange scores of the test set for all languages are available at http://language-lab.info/emblexchange/. Upon obtaining the required approvals the code will be available at https://github.com/ehsanasgari/EmbLexChange. Binary detection: EmbLexChange could detect the semantic changes in English, German, Swedish, and Latin with the accuracy of 70.3%, 75.3%, 77.4%, 60% respectively. The selected h value, the thresholds to assign the positive or the negative class for each language, are also provided in Table 2. Ranking: The Kendall-τ p-values for English, German, and Swedish show that there is a significant correspondence between the EmbLexChange scores and the ground truth scores in those languages. The Spearman rank-order correlation coefficient is also calculated for all languages, with an average of 0.357 over the four languages. The case of Latin has been more challenging in both binary and graded prediction of lexical semantic change.

Discussion and Conclusions
In this paper, we proposed EmbLexChange, a framework for the detection of lexical semantic changes in an unsupervised manner. We defined EmbLexChange as the divergence between the embedding-based profiles of word w (calculated for a set of pivot words) in the source and the target domains (e.g. between two historical time-frames). With the selection of pivot words by a resampling framework, we raise the reliability of this divergence. The underlying assumption of our method is that the changes in lexical semantics of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We showed that EmbLexChange can reliably detect lexical-semantic changes in English, German, Swedish, and Latin achieving the second place in the binary detection of semantic changes in the SemEval-2020. The detection of semantic changes in Latin has been more challenging than for other languages. One reason behind this can be the imbalance of embedding training instances for Latin t 1 and Latin t 2 as well as the overall smaller corpora for Latin in comparison to the other languages (shown in Table 1). Another reason can be the split of time frames, where t 2 in Latin spans a large period of 2000 years.
The SemEval overall results show that EmbLexChange works better in the binary detection of semantic changes versus its performance in the ranking problem setting . However, we should note that the manual creation of ranking ground truth is a much more challenging task than the creation of binary classification ground truth. Thus, we believe that the classification results might be more reliable than the ones for the ranking.
The EmbLexChange requires only the raw texts in the time-frames/domains of interest. Then the semantic changes can be detected based on the divergence between the embedding-based profiles of words of source and target domains. One advantage of using an embedding-based profile is that by increasing the window size in the embedding training we can move from syntactic changes toward semantic changes which can be investigated in more depth as a future direction of research.