UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection

In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.


Introduction
Language evolves with time. New words appear, old words fall out of use, the meanings of some words shift. The culture changes as well as the expected audience of the printed word. There are changes in topics, in syntax, in presentation structure. Reading the natural philosophy musings of aristocratic amateurs from the eighteenth century, and comparing with a monograph from the nineteenth century, or a medical study from the twentieth century, we can observe differences in many dimensions, some of which seem hard to study. Changes in word senses are both a visible and a tractable part of language evolution. Computational methods for researching words' stories have the potential of helping us understand this small corner of linguistic evolution. The tools for measuring these diachronic semantic shifts might also be useful for measuring whether the same word is used in different ways in synchronic documents. The task of finding word sense changes over time is called diachronic Lexical Semantic Change (LSC) detection. The task is getting more attention in recent years (Hamilton et al., 2016b;Frermann and Lapata, 2016;Schlechtweg et al., 2017). There is also the synchronic LSC task, which aims to identify domain-specific changes of word senses compared to general-language usage (Schlechtweg et al., 2019). Tahmasebi et al. (2018) provides a comprehensive survey of techniques for the LSC task, as does Kutuzov et al. (2018). Schlechtweg et al. (2019) evaluated available approaches for LSC detection using the DURel dataset (Schlechtweg et al., 2018). Some of the methodologies for finding time-sensitive meanings were borrowed from information retrieval techniques in the first place. According to Schlechtweg et al. (2019), there are mainly three types of approaches. (1) Semantic vector spaces approaches (Gulordava and Baroni, 2011;Kim et al., 2014;Xu and Kemp, 2015;Eger and Mehler, 2016;Hamilton et al., 2016a;Hamilton et al., 2016b;Rosenfeld and Erk, 2018) represent each word with two vectors for two different time periods. The change of meaning is then measured by the cosine distance between the two vectors. (2) Topic modeling approaches (Wang and McCallum, 2006;Bamman and Crane, 2011;Wijaya and Yeniterzi, 2011;Mihalcea and Nastase, 2012;Cook et al., 2014;Frermann and Lapata, 2016;Schlechtweg and Walde, 2020) estimate a probability distribution of words over their different senses, i.e., topics. (3) Clustering models (Mitra et al., 2015;Tahmasebi and Risse, 2017) are used to cluster words into clusters representing different senses.
We participated in the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection  competition. In this paper, we describe our solution and submitted systems for this competition. The task consists of two sub-tasks, a binary classification task (Sub-task 1) and a ranking task (Sub-task 2), which involve comparing usage of target words between two lemmatized corpora, each drawn from documents from a different time period for four languages: English, German, Latin, and Swedish. For both sub-tasks, only the target words and two corpora for each language were provided by organizers, no annotated data. The task is intended to be solved in a completely unsupervised way.
In the binary classification task, the goal is for two given corpora C 1 and C 2 (for time t 1 and t 2 ) and for a set of target words, decide which of these words changed or did not change their sense (semantic) between t 1 and t 2 . Change of sense is whether the word lost or gained any sense between the two periods (corpora). The objective of the Sub-task 2 is for two given corpora C 1 and C 2 , rank a set of target words according to their degree of lexical semantic change between t 1 and t 2 . A higher rank means a stronger change. Target words are the same for both sub-tasks.
Because language is evolving, expressions, words, and sentence constructions in two corpora from different time periods about the same topic will be written in languages that are quite similar but slightly different. They will share the majority of their words, grammar, and syntax.
The main idea behind our solution is that we treat each pair of corpora C 1 and C 2 as different languages L 1 and L 2 even though that text from both corpora is written in the same language. We believe that these two languages L 1 and L 2 will be extremely similar in all aspects, including semantic. We train separate semantic space for each corpus and subsequently, we map these two spaces into one common cross-lingual space. We use methods for cross-lingual mapping (Brychcín et al., 2019;Artetxe et al., 2016;Artetxe et al., 2017;Artetxe et al., 2018a;Artetxe et al., 2018b) and thanks to the large similarity between L 1 and L 2 the quality of transformation should be high. We compute cosine similarity to classify and rank the target words, see Section 3 for details.
Our systems 1 ranked 1 st out of 33 teams in Sub-task 1 with an average accuracy of 0.687, and 4 th out of 32 teams in Sub-task 2 with an average Spearman's rank correlation of 0.481.

Data
The corpora are drawn from several sources, described in . Table 1 shows periods and sizes. For each language, items in the earlier corpus are separated from items in the later one by at least 46 years (German) and as much as 2200 years (Latin). All corpora are lemmatized, punctuation is removed, and sentences are randomly reordered. For English, target words have been marked with their part-of-speech. Two example sentences:(1) "there be no pleasing any of you do as we may", and (2) "rise upon the instant in his stirrup the bold cavalier hurl with a sure and steady hand the discharge weapon in the face nn of his next opponent". (1) illustrates a failure of lemmatization -the word 'pleasing', which is a form of the verb 'please'; (2) shows a target word 'face', marked as a noun. Less than 10% of the English sentences contain a target word. Lemmatization reduces the vocabulary so that there are more examples of each word. It also introduces ambiguity; the decisions to add a POS tag to English target words and retain German noun capitalization shows that the organizers were aware of this problem.

System Description
First, we train two semantic spaces from corpus C 1 and C 2 . We represent the semantic spaces by a matrix X s (i.e., a source space s) and a matrix X t (i.e, a target space t) 2 using word2vec Skip-gram with negative sampling (Mikolov et al., 2013). We perform a cross-lingual mapping of the two vector spaces, getting two matricesX s andX t projected into a shared space. We select two methods for the crosslingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychcín et al., 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b). Both of these methods are linear transformations. In our case, the transformation can be written as follows: where W s→t is a matrix that performs linear transformation from the source space s (matrix X s ) into a target space t andX s is the source space transformed into the target space t (the matrix X t does not have to be transformed because X t is already in the target space t and X t =X t ). Generally, the CCA transformation transforms both spaces X s and X t into a third shared space o (where X s =X s and X t =X t ). Thus, CCA computes two transformation matrices W s→o for the source space and W t→o for the target space. The transformation matrices are computed by minimizing the negative correlation between the vectors x s i ∈ X s and x t i ∈ X t that are projected into the shared space o. The negative correlation is defined as follows: where cov the covariance, var is the variance and n is a number of vectors. In our implementation of CCA, the matrixX t is equal to the matrix X t because it transforms only the source space s (matrix X s ) into the target space t from the common shared space with a pseudo-inversion, and the target space does not change. The matrix W s→t for this transformation is then given by: The submissions that use CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the -r part means that the source and target spaces are reversed, see Section 4. The -nn and -bin parts refer to a type of threshold used only in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no difference for the following pairs of submissions: cca-nncca-bin and cca-nn-rcca-bin-r.
In the case of the Orthogonal Transformation, the submissions are referred to as ort & uns. We use Orthogonal Transformation with a supervised seed dictionary consisting of all words common to both semantic spaces. (ort). The transformation matrix W s→t is given by: under the hard condition that W s→t needs to be orthogonal, where V is the vocabulary of correct word translations from source to target space. The reason for the orthogonality constraint is that linear transformation with the orthogonal matrix does not squeeze or re-scale the transformed space. It only rotates the space, thus it preserves most of the relationships of its elements (in our case it is important that orthogonal transformation preserves angles between the words, so it preserves the cosine similarity). Artetxe et al. (2018b) also proposed a method for automatic dictionary induction. This is a fully unsupervised method for finding orthogonal cross-lingual transformations. We used this approach for our uns submissions.
Finally in all transformation methods, for each word w i from the set of target words T , we select its corresponding vectors v s w i and v t w i from matricesX s andX t , respectively (v s w i ∈X s and v t w i ∈X t ), and we compute cosine similarity between these two vectors. The cosine similarity is then used to generate an output for each sub-task. For Sub-task 2, we compute the degree of change for word w i as 1 − cos(v s w i , v t w i ). The settings of hyper-parameters for both methods give us several combinations, see sections 4 and 5.

Binary System
The organizers provided a definition for binary change in terms of specific numbers of usages of senses of a target word. We decided against attempting to model and group individual word usages. Instead, we decided to use the continuous scores from Sub-task 2 for the binary task. We assume that there is a threshold t for which the target words with a continuous score greater than t changed meaning and words with the score lower than t did not. We know that this assumption is generally wrong (because using the threshold we introduce some error into the classification), but we still believe it holds for most cases and it is the best choice. In order to examine this assumption after the evaluation period, we computed the accuracy of the gold ranking scores with the optimal threshold (selected to maximize test accuracy). As you can see in Table 2, even if an optimal threshold is chosen, the best accuracy that can be achieved is, on average, 87.6%.  Table 2: Sub-task 1 accuracy with gold ranking and optimal thresholds.
In order to find the threshold t, we tried several approaches. We call the first approach binary-threshold (-bin in Table 3). For each target word w i we compute cosine similarity of its vectors v s w i and v t w i , then we average these similarities for all words. The resulting averaged value is used as the threshold. Another approach called global-threshold (-gl) is done similarly, but the average similarity is computed across all four languages. The last approach, called nearest-neighbors (-nn), compares sets of nearest neighbours. For target word w i we find 100 nearest (most similar) words from both transformed spaces (matricesX s andX t ), getting two sets of nearest neighbours N s and N t for each target word. Then, we compute the size of the intersection of these two sets for each target word. From the array of intersection sizes, we select the second highest value 3 , and we divide it by two. The resulting number is a threshold for all target words. If the size of a target word's intersection is greater or equal to the threshold, we classify the word unchanged; otherwise changed. This threshold is set for each language independently.

Experimental Setup
To obtain the semantic spaces, we employ Skip-gram with negative sampling (Mikolov et al., 2013). We use the Gensim framework (Řehůřek and Sojka, 2010) for training the semantic spaces. For the final submission, we trained the semantic spaces with 100 dimensions for five iterations with five negative samples and window size set to five. Each word has to appear at least five times in the corpus to be used in the training. For all ccasubmissions, we build the translation dictionary for the cross-lingual transformation of the two spaces by removing the target words from the intersection of their vocabularies.
For cca-nn-r and cca-bin-r we change the direction of the cross-lingual transformation. The initial setup for the transformation is that the source space is the space of the older corpus C 1 (represented by the matrix X s ), and the target space is the semantic space of the later corpus C 2 (represented by the matrix X t ). We reversed the order, and we use the matrix X t as the source space, which is transformed into semantic space (matrix X s ) of the older corpus, i.e. into the original source space.
The two other methods are Orthogonal Transformations with identical words as the seed dictionary (ort) and unsupervised transformation (uns) from the VecMap tool. In these methods, we use median similarity as the threshold for the binary task. We experimented with a separate threshold for each language (bin suffix) and with a single global threshold for all the languages (gl suffix).
The sample data provided with the task did not contain any labeled development target words, so we used the DURel (Schlechtweg et al., 2018) corpus and WOCC (Schlechtweg et al., 2019) corpus 4 to validate our solutions' performance. The WOCC corpus contains lemmatized German sentences from Deutsches Textarchiv (Deutsches Textarchiv, 2017) for two time periods 1750-1799 and 1850-1899 , which we used to train the semantic spaces. We evaluated our system on the DURel corpus, and we achieved 0.78 of Spearman's correlation rank with settings that correspond to the column named ccann-r in Table 3.

Results
We submitted eight different submissions for the Sub-task 1 and four for the Sub-task 2 obtained by CCA and the VecMap tool. We also submitted predictions based on Latent Dirichlet Allocation (Blei et al., 2003) (LDA), but because of its poor results and limited paper size, we do not include the description here. The results, along with the ranking 5 are shown in Table 3. The bold results denote the best result we achieved for each language, and the underlined results were used in our team's final ranking.
Task cca-nn cca-nn-r cca-bin cca-bin-r ort-bin ort-gl uns-bin uns-gl  The cca-bin-r system settings achieved the first-place rank out of 189 submissions by 33 teams on Sub-task 1, with an absolute accuracy of 0.687. The majority class was unchanged, and always choosing it would have given a score of 0.571. The submissions with the top 15 best accuracies (four teams) had scores ranging from 0.659 to 0.687 and assorted scores on the different languages. The four of our systems using the -bin approach, which included the top-scoring system, have a mean percentile of 94.6; the two -gl strategy systems had a mean percentile of 92.7, and the two systems with the -nn strategy had a mean percentile of 75.5. On Sub-task 2, our best system achieved the fourth-place rank out of 33 teams. It had an average correlation against the gold ranking for all four languages of 0.481.
Since the threshold strategy was used only for Sub-task 1, there is no difference in results for Sub-task 2 in Table 3 in the following pairs of columns cca-nncca-bin, cca-nn-rcca-bin-r, ort-binort-gl and uns-binuns-gl. Thus, for Sub-task 2 we provide numbers only in the cca-nn, cca-nn-r, ort-bin and uns-bin columns.
Examining Table 3, the ranking scores for Latin are not only worse than the other languages in absolute terms, but their position relative to other submissions is also much worse. The Latin corpora have several anomalies, but only the small size of the earlier corpus (a third the size of the next smallest corpus) seems likely to be a problem. For example, although both Latin corpora have a much larger proportion of short lines than the others, a measurement of the mean context size for target words shows that for all of the corpora, the mean context size for target words is between 8.38 (German 2) and 8.95 (German 1).
A different problem is syntax. The lemmatization of the corpora is no obstacle to the English reader, who can usually guess the correct form, because word-order in English (and also in Swedish and German) is somewhat rigid, and word inflexions are largely replaced by separate words, which have their own lemmas. In Latin, word-order is quite flexible, and different authors have different preferred schemes for emphasis and euphony. Two adjacent words are not required to be part of each other's context, and the most semantically related words in a sentence may be separated precisely to emphasize them.
We performed post-evaluation experiments with word embedding vector size in order to discover its effect on system performance, see Figure 1 containing visualization of results for Sub-task 2. It shows that the optimal size for this task is in most cases between 100 and 175 for Sub-task 2, and between 75 and 125 for Sub-task 1 (not shown here). We also tried using fastText (Bojanowski et al., 2017) instead of word2vec Skip-gram to get the semantic space, with settings that correspond to the ones we used for the final submissions. The performance using fastText was not improved in general, but for some settings we obtained slightly better results. Other experiments suggest that Latin results can benefit when the context size is increased during the training of the semantic spaces. According to submitted results, it seems that CCA method with reversed order (columns cca-nn-r, cca-bin-r in Table 3) works better than without reversed but from the Figure 1 is evident that it is valid only for English and Latin.

Conclusion
Applying a threshold to semantic distance is a sensible architecture for detecting the binary semantic change in target words between two corpora. Our binary-threshold strategy succeeded quite well. We did have a small advantage in that the list of target words turned out to be nearly equally divided between changed (0.57) and unchanged (0.43) words. Thus, choosing thresholds assuming that the division was 50/50 was not a severe problem. Our experiments also reveal the limits of a threshold strategy, as shown in Table 2. Second, although our systems did not win Sub-task 2 change-ranking competition, they show that our architecture is a strong contender for this task when there is sufficient data to build good vector semantic spaces. The results for Latin illustrate that there is still room for other, less statistical approaches; one might have predicted, falsely, that 1.7M tokens was plenty of data. The variety of experiments which were possible in the post-evaluation period suggest that the corpora developed for the task will encourage work in the area. In future work, we expect to focus on other cross-lingual techniques and other methods of measuring similarity between words besides cosine similarity.