Unsupervised Embedding-based Detection of Lexical Semantic Changes

This paper describes EmbLexChange, a system introduced by the “Life-Language” team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is deﬁned as the divergence between the embedding based proﬁles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames t 1 and t 2 ). The underlying assumption is that the lexical-semantic change of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We show that using a resampling framework for the selection of reference words, we can reliably detect lexical-semantic changes in English, German, Swedish, and Latin. EmbLexChange achieved second place in the binary detection of semantic changes in the SemEval-2020.


Introduction
SemEval 2020 Task 1 is defined on the unsupervised detection of word sense changes over time in German, English, Swedish, and Latin. In particular, this challenge focused on detection and quantification of the sense changes of word w in the transition from the time period t 1 to the time period t 2 in the above mentioned four languages, where the input for each language are the text corpora dating from t 1 and t 2 . This challenge involved two sub-tasks: i. Classification: The goal of the classification task is the binary detection of lexical semantic change (from t 1 to t 2 ) for the given word w.
ii. Ranking: This sub-task involves the ranking of lexical-semantic change for a given list of words (w 1 , w 2 , . . . , w M ) by assigning scores quantifying relative changes of the word senses.
To measure the two sub-tasks, the participating systems are evaluated against a ground truth corpus annotated by native speakers or scholars of the respective languages .
Human languages constantly change due to cultural, technological, and social drift. Lexical semantic changes of human languages can materialize in the form of introducing/borrowing new words, or for the existing words can involve acquiring/losing some word senses. Computational methods for automated detection of semantic changes can be extremely helpful in the study of historical texts or corpora spanning a very long period of time, e.g., in the design of the OCR algorithm for text digitization, or in designing an information retrieval system incorporating the semantic changes (Tahmasebi et al., 2018). Applications in the study of historical texts aside, the proposed methods detect lexical-semantic drift also in the same time period for different domains. This can be useful for compiling glossaries and specific training material in certain industries where new senses are introduced for words as compared to their standard usage e.g. to facilitate a more efficient training for new employees.
In the past decade, a variety of methods were introduced in the literature for automatic detection of lexical-semantic changes (Tahmasebi et al., 2018), where we only can refer to a subset of work, including but not limited to (i) co-occurrence-based methods (Sagi et al., 2009;Basile et al., 2016), (ii) embedding-based approaches (Kim et al., 2014;Kulkarni et al., 2015;Asgari and Mofrad, 2016;Asgari, 2019;Asgari et al., 2020), (iii) topic-models-based (Frermann and Lapata, 2016), and (iv) alignmentbased (Bamman and Crane, 2011) approaches. In this paper, we extend our recently introduced DomDrift embedding-based approach for the detection of semantic changes (Asgari et al., 2020) introduced for the extension of the computational analyses on 1000+ languages (Asgari and Schütze, 2017). DomDrift works based on a comparison of relative distances of words in the embedding spaces of the source and the target domains. To increase the stability of detection in the DomDrift, we extend DomDrift to EmbLexChange by the following modifications: (i) instead of creating word profile against all common words between the source and target domain, we use only a subset of pivot words, which are frequent words with unchanged relative frequencies. (ii) We create multiple word profiles by resampling from a set of pivot words. We show that the EmbLexChange can reliably detect the lexical-semantic changes in English, German, Swedish, and Latin achieving an average accuracy of 0.686 as second best system of the competition where the first place system achieved an accuracy of 0.687.

System overview
Here we detail the steps of the EmbLexChange system, where the overview is depicted in Figure 1. The EmbLexChange framework is developed based on the following assumptions: H1: frequent words change at slower rates (Hamilton et al., 2016). H2: the relative frequency of unchanged words is not dramatically different in different time periods/domains. H3: changes of the word sense change the context and consequently alter the neighbors in the embedding space. Thus, the relative drift of a query word (a words which we target to investigate its lexical semantic change) with respect to unchanged words in the embedding space can characterize the lexical-semantic change.
1.1 Train embedding space for t 1 1.2 Train embedding space for t 2 The relative location of query in Ω t2 The relative location of query in Ω t1 Input corpus time period t2 Ct2 Figure 1: The overview of EmbLexChange system for unsupervised detection of lexical-semantic changes. The steps are detailed in the §2.1.

EmbLexChange
1. Training language-model-based embedding spaces: The training of word embeddings using language modeling objective (e.g., skip-gram) has shown to preserve the syntactic and the semantic regularities in the vector space (Mikolov et al., 2013;Pennington et al., 2014). Semantic changes impact the neighborhoods in the embedding space (H3). Thus, the first step is to train embeddings separately for the text corpora in time period t 1 and t 2 (steps 1.1 and 1.2 in 1). In order to generate the embedding space Ω t , the only necessary resource is the raw text. For embedding creation, we use fasttext (Bojanowski et al., 2017) which leverages subword information within the skip-gram architecture. Using sub-word information minimizes the query terms Out-OF-Vocabulary (Bojanowski et al., 2017). The result of this step would be separate embedding spaces Ω t 1 and Ω t 2 for the time periods t 1 and t 2 , 2. Selection of fixed words and prepare pivot sets: To measure the degree of semantic change for the given query words in Ω t 1 and Ω t 2 , we need some fixed points, called pivot set V P comprising words with the property that their semantics are not dramatically changed and their relative positions in Ω t 1 and Ω t 2 remain almost constant (step 2 in 1). For this purpose, based on H1 and H2, we propose the use of frequent words with their relative frequency higher than α in both time periods of t 1 and t 2 . Secondly, we filter this set by removing words whose relative frequency has changed between t 1 and t 2 , resulting in V P . These fixed points are then used to create query profiles in t 1 and t 2 . In order to increase the reliability and make variance analysis possible, we execute N resamples each containing M words from 3. Query profiles creation: In the next step, for each query word, we create t 1 and t 2 profiles based on the pivot resamples (step 3 in 1). The profile in time t is an l1 normalized embedding similarity vector of the query to the terms in V (i) P : P is the i th resample from the pivot set (created in step 2), w q is the query word, w (i) k is the k th word in the i th resample, − → w is the vector representation of word w in the embedding space Ω t , φ is the temperature of the softmax function (used as a hyper parameter). Hence, for each V (i) P we can create one profile in t 1 and one profile in t 2 .
4. Profile divergence calculation: Next, for each resample V (i) P we calculate the divergence between the profile in the time period t 1 and the time period t 2 using KL-divergence: We average the λ i s over N resamples as the measure of semantic change for the query word w q . Since D KL does not have an upper-bound, we estimate an upper-bound based on λ i 's on resamples of a large set of randomly selected words V explore including V P and a set of words with a change in their relative frequency. We draw K resamples of size M words from V explore and calculate the λ k 's (λ's of words in k th resample of V explore ). We select the average of 90 th percentile over K resamples as the upper bound and the average of 10 th percentile as the lower bound of the λ to scale any calculated λ i for a query word toλ i (0 ≤λ i ≤ 1). Considering a threshold of h, we assignλ > h to the category of lexical semantic change, which can be adjusted as a hyperparameter on a validation set.
The dataset used in this shared task includes corpora of English, German, Latin, and Swedish texts. For each language, the text corpora of two time periods are given. More details on the on the extact time frames and data sizes are provided in Table 1.

Experiment
The goal of SemEval task 1 is to detect the words with a change in their semantics in the transition from the time period t 1 to the time period t 2 in English, German, Swedish, and Latin languages. We closely follow the steps described in §2.
1. Language-model-based embedding setup: We train fasttext (Bojanowski et al., 2017) embedding using the skip-gram architecture for each pair of language and time period separately. In the training of fasttext, we set the window size to c = 7 and the embedding size of d = 100. In the presence of a validation set, both c and d can be optimized as the hyper-parameters for each setting. It is known that a larger c is favorable for semantics representation of words and a smaller c for a syntax-related representation.

Pivot resamples creation:
We firstly prepare a set of frequent words existing in both t 1 and t 2 for each language considering the α (relative freq.) as a way to select the top 10% − 20% frequent words. Next, we filter this set to keep the words with the property that the ratio of their normalized frequency is not substantially changed in t 1 and t 2 , 2 3 < f req t1 (w) f req t2 (w) < 3 2 resulting in our V P set. Subsequently, we draw N = 10 resamples from V P with the size of M = 5000 for each language.
3. Query profiles creation: In the next step, as presented in §2 step 3, for each query word we create t 1 and t 2 profiles for each N = 10 pivot resamples as in the previous step.
4. Profile divergence calculation: Next, for each of N resamples V (i) P , we calculate the λ i and scale them toλ i using K = 5 resamples of size M = 5000 words from V explore . Subsequently, for the binary detection of changes, we apply different thresholds h over the average of scaled divergences (the average ofλ i 's =λ ) of V (i) P 's and assignλ > h to the category of lexical semantic change.
Evaluation: For evaluation purpose, in the case of binary detection the accuracy metric is used to compare the given ground-truth and the predicted lexical semantic changes. For the ranking setting, we report both pearson correlation coefficient and Kendall-τ to measure the correspondence between calculated divergences (λ's) and the provided ground-truth scores.

Results
The results of EmbLexChange in the detection and the quantification of changes in the lexical semantics in English, German, Swedish, and Latin are provided in Table 2. After the competition, we had the chance to perform further optimizations of the hyperparameters leading to the current results, slightly improved from those submitted to the competition leaderboard. The EmbLexChange scores of the test set for all languages are available at http://language-lab.info/emblexchange/.
Binary detection: EmbLexChange could detect the semantic changes in English, German, Swedish, and Latin with the accuracy of 70.3%, 75.3%, 77.4%, 60% respectively. The selected h value, the thresholds to assign the positive or the negative class for each language, are also provided in Table 2.
Ranking: The Kendall-τ p-values for English, German, and Swedish show that there is a significant correspondence between the EmbLexChange scores and the ground truth scores in those languages. The Pearson correlation is also calculated for all languages, with an average of 0.306 over four languages. The case of Latin has been more challenging in both binary and graded prediction of lexical semantic change.

Discussion and Conclusions
In this paper, we proposed EmbLexChange, a framework for the detection of lexical semantic changes in an unsupervised manner. We defined EmbLexChange as the divergence between the embedding-based profiles of word w (calculated for a set of pivot words) in the source and the target domains (or between two time-frames). With the selection of pivot words by a resampling framework, we raise the reliability of this divergences. The underlying assumption of our method is that the changes in lexical semantics of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces.
We showed that EmbLexChange can reliably detect lexical-semantic changes in English, German, Swedish, and Latin achieving the second place in the binary detection of semantic changes in the SemEval-2020. The detection of semantic changes in Latin has been more challenging than other languages. One reason behind this can be the imbalance of embedding training instances for Latin t 1 and Latin t 2 as well as the overall smaller corpora for Latin in comparison to the other languages (shown in Table 1). Another reason can be the split of time frames, where t 2 in Latin spans a large period of 2000 years.
The SemEval overall results show that EmbLexChange works better in the binary detection of semantic changes versus its performance in the ranking problem setting . However, we should note that the manual creation of ranking ground truth is a much more challenging task than the creation of binary classification ground truth. Thus, we believe that the classification results might be more reliable than ones for the ranking.
The EmbLexChange requires only the raw texts in the time-frames/domains of interest. Then the semantic changes can be detected based on the divergence between the embedding-based profiles of words between the source and the target domain. One advantage of using an embedding-based profile is that by increasing the window size in the embedding training we can move from syntactic changes toward semantic changes which can be investigated in more depth as a future direction.