SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.


Overview
Recent years have seen an exponentially rising interest in computational Lexical Semantic Change (LSC) detection (Tahmasebi et al., 2018;Kutuzov et al., 2018). However, the field is lacking standard evaluation tasks and data. Almost all papers differ in how the evaluation is performed and what factors are considered in the evaluation. Very few are evaluated on a manually annotated diachronic corpus Perrone et al., 2019;Schlechtweg et al., 2019, e.g.). This puts a damper on the development of computational models for LSC, and is a barrier for high-quality, comparable results that can be used in follow-up tasks.
We report the results of the first SemEval shared task on Unsupervised LSC detection. 1 We introduce two related subtasks for computational LSC detection, which aim to identify the change in meaning of words over time using corpus data. We provide a high-quality multilingual (English, German, Latin, Swedish) LSC gold standard relying on approximately 100,000 instances of human judgment. For the first time, it is possible to compare the variety of proposed models on relatively solid grounds and across languages, and to put previously reached conclusions on trial. We may now provide answers to questions concerning the performance of different types of semantic representations (such as token embeddings vs. type embeddings, and topic models vs. vector space models), alignment methods and change measures. We provide a thorough analysis of the submitted results uncovering trends for models and opening perspectives for further improvements. In addition to this, the CodaLab website will remain open to allow any reader to directly and easily compare their results to the participating systems. We expect the long-term impact of the task to be significant, and hope to encourage the study of LSC in more languages than are currently studied, in particular less-resourced languages.

Subtasks
For the proposed tasks we rely on the comparison of two time-specific corpora C 1 and C 2 . While this simplifies the LSC detection problem, it has two main advantages: (i) it reduces the number of time periods for which data has to be annotated, so we can annotate larger corpus samples and hence more reliably represent the sense distributions of target words; (ii) it reduces the task complexity, allowing C1 C2 Senses chamber biology phone chamber biology phone # uses 12 18 0 4 11 18 Table 1: An example of a sense frequency distribution for the word cell in C 1 and C 2 .
different model architectures to be applied to it, widening the range of possible participants. Participants were asked to solve two subtasks: Subtask 1 Binary classification: for a set of target words, decide which words lost or gained sense(s) between C 1 and C 2 , and which ones did not. Subtask 2 Ranking: rank a set of target words according to their degree of LSC between C 1 and C 2 .
For Subtask 1, consider the example of cell in Table 1, where the sense 'phone' is newly acquired from C 1 to C 2 because its frequency is 0 in C 1 and > 0 in C 2 . Subtask 2, instead, captures fine-grained changes in the two sense frequency distributions. For example, Table 1 shows that the frequency of the sense 'chamber' drops from C 1 to C 2 , although it is not totally lost. Such a change will increase the degree of LSC for Subtask 2, but will not count as change in Subtask 1. The notion of LSC underlying Subtask 1 is most relevant to historical linguistics and lexicography, while the majority of LSC detection models are rather designed to solve Subtask 2. Hence, we expected Subtask 1 to be a challenge for most models. Knowing whether, and to what degree a word has changed is crucial in other tasks, e.g. aiding in understanding historical documents, searching for relevant content, or historical sentiment analysis. The full LSC problem can be seen as a generalization of these two tasks into multiple time points where also the type of change needs to be identified.

Data
The task took place in a realistic unsupervised learning scenario. Participants were provided with trial and test data, but no training data. The public trial and test data consisted of a diachronic corpus pair and a set of target words for each language. Participants' predictions were evaluated against a set of hidden gold labels. The trial data consisted of small samples from the test corpora (see below) and four target words per language to which we assigned binary and graded gold labels randomly. Participants could not use this data to develop their models, but only to test the data input format and the online submission format. For development data participants were referred to three pre-existing diachronic data sets: DURel (Schlechtweg et al., 2018), SemCor LSC (Schlechtweg and Schulte im Walde, 2020) and WSC (Tahmasebi and Risse, 2017). In the evaluation phase participants were provided with the test corpora and a set of target words for each language. 2 Participants were asked to train their models only on the corpora described in Table 2, though the use of pre-trained embeddings was allowed as long as they were trained in a completely unsupervised way, i.e., not on manually annotated data.

Corpora
For English, we used the Clean Corpus of Historical American English (CCOHA) (Davies, 2012;Alatrash et al., 2020), which spans 1810s-2000s. For German, we used the DTA corpus (Deutsches Textarchiv, 2017) and a combination of the BZ and ND corpora (Berliner Zeitung, 2018;Neues Deutschland, 2018). DTA contains texts from different genres spanning the 16th-20th centuries. BZ and ND are newspaper corpora jointly spanning 1945-1993. For Latin, we used the LatinISE corpus (McGillivray and Kilgarriff, 2013) spanning from the 2nd century B.C. to the 21st century A.D. For Swedish, we used the Kubhist corpus (Språkbanken, Downloaded in 2019), a newspaper corpus containing texts from 18th-20th century. The corpora are lemmatised and POS-tagged. CCOHA and DTA are spelling-normalized. BZ, ND and Kubhist contain frequent OCR errors (Adesam et al., 2019;Hengchen et al., to appear).
From each corpus we extracted two time-specific subcorpora C 1 , C 2 , as defined in Table 2. The division was driven by considerations of data size and availability of target words (see below). From these  two subcorpora we then sampled the released test corpora in the following way: Sentences with < 10 tokens (< 2 for Latin) were removed. German C 2 was downsampled to fit the size of C 1 by sampling all sentences containing target lemmas and combining them with a random sample of sentences not containing target lemmas of suited size. An equal procedure was applied to downsample English C 1 and C 2 . For Latin and Swedish the full amount of sentences was used. Finally, all tokens were replaced by their lemma, punctuation was removed and sentences were randomly shuffled within each of C 1 , C 2 . 3 Find a summary of the released test corpora in Table 2.

Target words
Target words are either: (i) words that changed their meaning(s) (lost or gained a sense) between C 1 and C 2 ; or (ii) stable words that did not change their meaning during that time. 4 A large list of 100-200 changing words was selected by scanning etymological and historical dictionaries (Paul, 2002;Svenska Akademien, 2009;OED, 2009) for changes within the time periods of the respective corpora. This list was then further reduced by one annotator who checked whether there were meaning differences in samples of 50 uses from C 1 and C 2 per target word. Stable words were then chosen by sampling a control counterpart for each of the changing words with the same POS and comparable frequency development between C 1 and C 2 , and manually verifying their diachronic stability as described above. Both types of words were annotated to obtain their sense frequency distributions as described below, which allowed us to verify the a-priori choice of changing and stable words. By balancing the target words for POS and frequency we aim to minimize the possibility that model biases towards these factors lead to artificially high performance (Dubossarsky et al., 2017; Schulte im Walde, 2020).

Hidden/True Labels
For Subtask 1 (binary classification) each target word was assigned a binary label (l ∈ {0, 1}) via manual annotation (0 for stable, 1 for change). For Subtask 2 each target word was assigned a graded label (0 ≤ l ≤ 1) according to their degree of LSC derived from the annotation (0 means no change, 1 means total change). The hidden labels were published in the post-evaluation phase. 5 Both types of labels (binary and graded) were derived from the sense frequency distributions of target words in C 1 and C 2 as obtained from the annotation process. For this, we adopt change notions similar to Schlechtweg and Schulte im Walde (2020) as described below.

Annotation
We focused our efforts on annotating large and more representative samples for a limited number of words rather than annotating many words. 6 In this section we describe the setup of the annotation for the modern 3 Sentence shuffling and lemmatization were done for copyright reasons. Participants were provided with start and end positions of sentences. Where Kubhist did not provide lemmatization (through KORP (Borin et al., 2012)) we left tokens unlemmatized. Additional pre-processing steps were needed for English: for copyright reasons CCOHA contains frequent replacement tokens (10 x '@'). We split sentences around replacement tokens and removed them as a first step in the preprocessing pipeline. Further, because English frequently combines various POS in one lemma and many of our target words underwent POS-specific semantic changes, we concatenated targets in the English corpus with their broad POS tag ('target pos'). Also, the joint size of the CCOHA subcorpora had to be limited to ∼10M tokens because of copyright issues. 4 A target word is represented by its lemma form. 5 https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd-post 6 An indication that random samples with the chosen sizes can indeed be expected to be representative of the population is given by the results of the simulation study described in Appendix A: We were able to nearly fully recover the population clustering structure from the samples (average of > .96 adjusted mean rand index).  languages (English, German, and Swedish) first. The setup for Latin is slightly different and we describe it later in this section.
We started with four annotators per language, but had to add additional annotators later because of a high annotation load and dropouts. The total number of annotators for English/German/Swedish was 9/8/5. All annotators were native speakers and present or former university students. For German we had two annotators with a background in historical linguistics, while for English and Swedish we had one such annotator. For each target word we randomly sampled 100 uses from each of C 1 and C 2 for annotation (total of 200 uses per target word). 7 If a target word had less than 100 uses, we annotated the full sample. We then mixed the use samples of a target word into a joint set U and annotated U using an extension of the DURel framework (Schlechtweg et al., 2018;Erk et al., 2013). DURel produces high inter-annotator agreement even between non-expert annotators relying on the simple notion of semantic relatedness. Pairs of word uses from C 1 and C 2 are annotated on a four-point scale from unrelated meanings (1) to identical meanings (4) (see Table 3). Our extension consisted in the sampling procedure of use pairs: instead of annotating a random sample of pairs and using comparison of their mean relatedness over time as a measure of LSC (Schlechtweg et al., 2018), we aimed to sample pairs such that after annotation they span a sparsely connected usage graph combining the uses from C 1 , C 2 , where nodes represent uses and edges represent (the median of) annotator judgments (see Figure 1). This usage graph was then clustered into sets of uses expressing the same sense (Schütze, 1998). By further distinguishing two subgraphs for C 1 , C 2 we got two clusterings with a shared set of clusters, because they were obtained on the same total graph (Palla et al., 2007). We then equated the two clusterings obtained for C 1 , C 2 with their respective sense frequency distributions D 1 , D 2 . The change scores followed immediately (see below). Note that this extension remained hidden from the annotators: as with DURel their only task was to judge the relatedness of use pairs. These were presented to annotators in randomized order.

Edge sampling
Retrieving the full usage graph is not feasible even for a small set of n uses as this implies annotating n * (n − 1)/2 edges. Hence, the main challenge with our annotation approach was to reduce the number of edges to annotate as much as possible, while keeping the necessary information needed to infer a meaningful clustering on the graph. We did this by annotating the data in several rounds. After each round the usage graph of a target word was updated with the new annotations and a new clustering was obtained. 8 Based on this clustering we sampled the edges for the next round applying simple heuristics similar to Biemann (2013), a detailed description of which can be found in Appendix A. We spread the annotation load randomly over annotators making sure that roughly half of the use pairs were annotated by more than one annotator.

Special Treatment of Latin
Latin poses a special case due to the lack of native speakers. We recruited 10 annotators with a high-level knowledge of Latin, and ranging from undergraduate students to PhD students, post-doctoral researchers, and more senior researchers. We selected a range of target words whose meaning had changed between the pre-Christian and the Christian era according to the literature (Clackson, 2011) and in the pre-annotation trial we checked that both meanings were present in the corpus data. For each changed word, we  Table 4: Overview target words. n = number of target words, N/V/A = number of nouns/verbs/adjectives, AGR = inter-annotator agreement in round 1, LOSS = mean of normalized clustering loss * 10, JUD = number of judged use pairs, LSC = mean binary/graded change score, FRQ d = Spearman correlation between change scores and target words' absolute difference in log-frequency between C 1 , C 2 . Similarly for minimum frequency (FRQ m ) and minimum number of senses (PLY m ) across C 1 , C 2 .
selected a control word whose meaning did not change from the pre-Christian era and the Christian era, whose PoS was the same as the changed word, and whose frequency values in each of the two subcorpora (f cc 1 and f cc 2 ) were in the following intervals: , respectively, where p ranged between 0.03 and 0.15 and f tc 1 and f tc 2 are the frequency of the changed word in C 1 , C 2 . 9 In a trial annotation task our annotators reported difficulties and that they had to translate to their native language when comparing two excerpts of text. Hence, we decided to use a variation of the procedure described above which was introduced by Erk et al. (2013). Instead of use pairs, annotators judged the relatedness between a use and a sense definition from a dictionary, on the DURel scale. The sense definitions were taken from the Latin portion of the Logeion online dictionary. 10 We selected 30 sample sentences for each of C 1 , C 2 . Due to the challenge of finding qualified annotators, each word was assigned only to one annotator. We treated sense definitions as additional nodes in a usage graph connected to uses by edges representing annotator judgments. Clustering was then performed as for the other languages.

Clustering
The usage graphs we obtain from the annotation are weighted, undirected, sparsely observed and noisy. This poses a very specific problem that calls for a robust clustering algorithm. For this, we rely on a variation of correlation clustering (Bansal et al., 2004) by minimizing the sum of cluster disagreements, i.e., the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters. To see this, consider Blank (1997)'s continuum of semantic proximity and the DURel relatedness scale derived from it, as illustrated in Table 3. In line with Blank, we assume that use pairs with judgments of 3 and 4 are more likely to belong to the same sense, while judgments of 1 and 2 are more likely to belong to different senses. Consequently, we shift the weight W (e) of all edges e ∈ E in a usage graph G = (U, E, W) by W (e) − 2.5. We refer to those edges e ∈ E with a weight W (e) ≥ 0 as positive edges P E and edges with weights W (e) < 0 as negative edges N E . Let further C be some clustering on U , φ E,C be the set of positive edges across any of the clusters in clustering C and ψ E,C the set of negative edges within any of the clusters. We then search for a clustering C that minimizes L(C): That is, we try to minimize the sum of positive edge weights between clusters and (absolute) negative edge weights within clusters. Minimizing L is a discrete optimization problem which is NP-hard (Bansal et al., 2004). However, we have a relatively low number of nodes (≤ 200), and hence, the global optimum can be approximated sufficiently with a standard optimization algorithm. We choose Simulated Annealing (Pincus, 1970) as we do not have strong efficiency constraints and the algorithm showed superior performance in a simulation study. More details on the procedure can be found in Appendix A. In order to reduce the search space, we iterate over different values for the maximum number of clusters.
We also iterate over randomly as well as heuristically chosen initial clustering states. 11 This way of clustering usage graphs has several advantages: (i) It finds the optimal number of clusters on its own. (ii) It easily handles missing information (non-observed edges). (iii) It is robust to errors by using the global information on the graph. That is, a wrong judgment can be outweighed by correct ones. (iv) It directly optimizes an intuitive quality criterion on usage graphs. Many other clustering algorithms such as Chinese Whispers (Biemann, 2006) make local decisions, so that the final solution is not guaranteed to optimize a global criterion such as L. (v) By weighing each edge with its (shifted) weight, L respects the gradedness of word meaning. That is, edges with |W (e)| ≈ 0 have less influence on L than edges with |W (e)| ≈ 1.5. Finally, it showed superior performance to all other clustering algorithms we tested in a simulation study. (See Appendix A.)

Change scores
A sense frequency distribution (SFD) encodes how often a word w occurs in each of its senses (McCarthy et al., 2004;Lau et al., 2014, e.g.). From the clustering we obtain two SFDs D, E for a word w in the two corpora C 1 , C 2 , where each cluster corresponds to one sense. 12 Binary LSC for Subtask 1 of the word w is then defined as where D i and E i are the frequencies of sense i in C 1 , C 2 and k, n are lower frequency thresholds aimed to avoid that small random fluctuations in sense frequencies caused by sampling variability or annotation error are misclassified as change . According to Definition 2, a word is classified as gaining a sense, if the sense is attested at most k times in the annotation sample from C 1 , but attested at least n times in the sample from C 2 . (Similarly for words that lose a sense.) We set k = 0, n = 1 for the smaller samples (≤ 30) in Latin and k = 2, n = 5 for the larger samples (≤ 100) in English, German, Swedish. We make no distinction between words that gain vs. words that lose senses, both fall into the change class. Equally, we make no distinction between words that gain/lose one sense vs. words that gain/lose several senses.
For graded LSC in Subtask 2 we first normalize D and E to probability distributions P and Q by dividing each element by the total sum of the frequencies of all senses in the respective distribution. The degree of LSC of the word w is then defined as the Jensen-Shannon distance between the two normalized frequency distributions: where the Jensen-Shannon distance is the symmetrized square root of the Kullback-Leibler divergence (Lin, 1991;Donoso and Sanchez, 2017). G(w) is symmetric, ranges between 0 and 1 and is high if P and Q assign very different probabilities to the same senses. Note that B(w) and G(w) not necessarily correspond to each other: a word w may show no binary change but high graded change, or vice versa. Figure 1 and Figure 2 show the annotated and clustered usage graphs for Swedish target ledning and German target Eintagsfliege. Nodes represent uses of the target word. Edges represent the median of relatedness judgments between uses (black/gray lines for positive/negative edges). Colors make clusters (senses) inferred on the full graph. After splitting the full graph into the two time-specific subgraphs for C 1 , C 2 we obtain the two sense frequency distributions D 1 , D 2 . From these we inferred the binary and the graded change value. The two words represent semantic changes indicative of Subtask 1 and 2 respectively: ledning gains a sense with rather low frequency in C 2 . Hence, it has binary change, but low graded change. For Eintagsfliege, however, its two main senses exist in both C 1 and C 2 , while their frequencies change dramatically. Hence, it has no binary change, but high graded change. Find a summary of the annotation outcome for all languages and target words in Table 4. The final test sets contain between 31 (Swedish) and 48 (German) target words. Throughout the annotation we excluded several targets if they had a high number of '0' judgments or needed a high number of further edges to be annotated. As previous studies, we report the mean of Spearman correlations between annotator judgments as agreement measure. Erk et al. (2013) and Schlechtweg et al. (2018) report agreement scores between 0.55 and 0.68, which is comparable to our scores. 13 The clustering loss is the value of L (Definition 1) divided by the maximum possible loss on the respective graph. It gives a measure of how well the graphs could be partitioned into clusters by the L criterion.

Result
The class distribution (column 'LSC') for Subtask 1 differs per language as a result of several target words being dropped during the annotation. In Latin the majority of target words have binary change, while in Swedish the majority has no binary change. This is also reflected in the mean scores for graded LSC in Subtask 2. Despite the excluded target words the frequency statistics are roughly balanced (FRQ d , FRQ m ). However, we did not control the test sets for polysemy and there are strong correlations for English, German and Swedish between graded change and polysemy in Subtask 2 (PLY m ). This correlation reduces for binary change in Subtask 1 but is still moderate for English and Swedish and remains high for German.
In total, roughly 100,000 judgments were made by annotators. For English/German/Swedish ≈ 50% of the use pairs were annotated by more than one annotator. In total, the annotation cost roughly e 20,000 for 1,000 hours -twice as much as originally budgeted.

Evaluation
All teams were allowed a total of 10 submissions, the best of which was kept for the final ranking in the competition. Participants had to submit predictions for both subtasks and all languages. A submission's final score for each subtask was computed as the average performance across all four languages. During the evaluation phase, the leaderboard was hidden, as per SemEval recommendation.

Scoring Measures
For Subtask 1 submitted predictions were evaluated against the hidden labels via accuracy, given that we anticipated the class distribution for target words to be approximately balanced before the annotation. Scores are bounded between 0 and 1. As the distribution turned out to be imbalanced for some languages, we also report F1-score in Appendix C. For Subtask 2, we used Spearman's rank-order correlation coefficient ρ with the gold rank. Spearman's ρ only considers the order of the words, the actual predicted change values were not taken into account. Ties are corrected by assigning the average of the ranks that would have been assigned to all the tied values to each value (e.g. two words sharing rank 1 both get assigned rank 1.5). Scores are bounded between −1 (completely opposite to true ranking) and 1 (exact match).

Baselines
For both subtasks, we have two baselines: (i) Normalized frequency difference (Freq. Baseline) first calculates the frequency for each target word in each of the two corpora, normalizes it by the total corpus frequency and then calculates the absolute difference between these values as a measure of change.
(ii) Count vectors with column intersection and cosine distance (Count Baseline) first learns vector representations for each of the two corpora, then aligns them by intersecting their columns and measures change by cosine distance between the two vectors for a target word. A Python implementation of both these baselines was provided in the starting kit. A third baseline, for Subtask 1, is the majority class prediction (Maj. Baseline), i.e., always predicting the '0' class (no change).

Participating Systems
Thirty-three teams participated in the task, totaling 53 members. The teams submitted a total of 186 submissions. Given the large number of teams, we provide a summary of the systems in the body of this paper. A more detailed description of each participating system for which a paper was submitted is available in Appendix B. We also encourage the reader to read the full system description papers.
Participating models can be described as a combination of (i) a semantic representation, (ii) an alignment technique and (iii) a change measure. Semantic representations are mainly average embeddings (type embeddings) and contextualized embeddings (token embeddings). Token embeddings are often combined with a clustering algorithm such as K-means, affinity propagation (Frey and Dueck, 2007), (H)DBSCAN, GMM, or agglomerative clustering. One team uses a graph-based semantic network, one a topic model and several teams also propose ensemble models. Alignment techniques include Orthogonal Procrustes  Table 5 shows the type of system for every team's best submission for both subtasks.

Results
As illustrated by Table 5, UWB has the best performance in Subtask 1 for the average over all languages, closely followed by Life-Language, Jiaxin & Jinan 14 and RPI-Trust. 15 For Subtask 2, ------ Table 5: Summary of the performance of systems for which a system description paper was submitted, as well as their type of semantic representation for that specific submission in Subtask 1 (left) and Subtask 2 (right). For each team, we report the values of accuracy (Subtask 1) and Spearman correlation (Subtask 2) corresponding to their best submission in the evaluation phase. Abbreviations: Avg. = average across languages, EN = English, DE = German, LA = Latin, and SV = Swedish, type = average embeddings, token = contextualised embeddings, topic = topic model, ens. = ensemble, graph = graph, UCD = University College Dublin.
UG Student Intern performs best, followed by Jiaxin & Jinan and cs2020. 16 Across all systems, good performance in Subtask 1 does not indicate good performance in Subtask 2 (correlation between the system ranks is 0.22). However, and with the exception of Life-Language and cs2020, most top performing systems in Subtask 1 also excel in Subtask 2, albeit with a slight change of ranking.
Remarkably, all the top performing systems use static-type embedding models, and differ only in terms of their solutions to the alignment problem (Canonical Correlation Analysis, Orthogonal Procrustes, or Temporal Referencing). Interestingly, the top systems refine their models using one or more of the following steps: a) computing additional features from the embedding space; b) combining scores from different models (or extracted features) using ensemble models; c) choosing a threshold for changed words based on a distribution of change scores. We conjecture that these additional (and sometimes very original) post-processing steps are crucial for these systems' success. We now briefly describe the top performing systems in terms of these three steps (for further details please see Appendix B). UWB (SGNS+CCA+CD) sets the average change score as the threshold (c). Life-Language (SGNS) represents words according to their distances to a set of stable pivot words in two unaligned spaces, and compares their divergence relative to a distribution of change scores obtained from unstable pivot words (a+c). RPI-Trust (SGNS+OP) extract features (a word's cosine distance, change of distances to its nearest-neighbours and change in frequency), transform each word's feature to a CDF score, and averages these probabilities (a+b+c). Jiaxin & Jinan (SGNS+TR+CD) fits the empirical cosine distance change scores to a Gamma Quantile Threshold, and sets the 75% quantile as the threshold (c). UG Student Intern (SGNS+OP) measures change using Euclidean distance instead of cosine distance. cs2020 uses SGNS+OP+CD only as baseline method.
An important finding common to most systems is the difference between their performances across the four languages -systems that excel in one language do not necessarily perform well in another. This discrepancy may be due to a range of factors, including the difference in corpus size and the nature of the corpus data, as well as the relative availability of resources in some languages such as English over others. The Latin corpus, for example, covers a very long time span, and the lower performance of the systems on this language may be explained by the fact that the techniques employed, especially word token/type embeddings, have been developed for living languages and little research is available on their adaptation to dead and ancient languages. In general, dead languages tend to pose additional challenges compared to living languages (Piotrowski, 2012), due to a variety of factors, including their less-resourced status, lack of native speakers, high linguistic variation and non-standardized spelling, and errors in Optical Character Recognition (OCR). Other factors that should be investigated are data quality (Hill and Hengchen, 2019;van Strien et al., 2020): while English and Latin are clean data, German and Swedish present notorious OCR errors. The availability of tuned hyperparameters might have played a role as well: for German, some teams report following prior work such as . Finally, another factor for the discrepancy in performance between languages for any given system is not related to the nature of the systems nor of the data, but due to the fact that some teams focused on some languages, submitting dummy results for the others. Tables 5 and 6 illustrate the gap in performance between type-based embedding models and the token-based ones. Out of the best 10 systems in Subtask 1/Subtask 2, 7/8 systems are based on type embeddings compared to only 2/2 systems that are based on token embeddings (same holds for each language individually). Contrary to the recent success of token embeddings  and to commonly held view that contextual embeddings "do everything better", they are overwhelmingly outperformed by type embeddings in our task. This is most surprising for Subtask 1, because type embeddings do not distinguish between different senses, while token embeddings do. We suggest several possible reasons for these surprising results. The first is the fact that contextual embedding is a recent technology, and as such lacks proper usage conventions. For example, it is not clear whether a model should create an average token representation based on individual instances (and if so, which layers should be averaged), or if it should use clustering of individual instances instead (and if so, what type of clustering algorithm etc.). A second reason may be related to the fact that contextual models are pretrained and cannot exclusively be trained on the relevant historical resources (in contrast to type embeddings). As such, they carry additional, and possibly irrelevant, information that may mask true diachronic changes. The results may also be related to the specific preprocessing we applied to the corpora: (i) Only restricted context is available to the models as a result of the sentence shuffling. Usually, token-based models take more context into account than just the immediate sentence (Martinc et al., 2020). (ii) The corpora were lemmatized, while token-based models usually take the raw sentence as input. In order to make the input more suitable for token-based models, we also provide the raw corpora after the evaluation phase and will publish the annotated uses of the target words with additional context. 17

Type versus token embeddings
The influence of frequency In prior work, the predictions of many systems have been shown to be inherently biased towards word frequency, either as a consequence of an increasing sampling error with lower frequency (Dubossarsky et al., 2017) or by directly relying on frequency-related variables (Schlechtweg et al., 2017;. We have controlled for frequency when selecting target words (recall Table 4) in order to test model performance when frequency is not an indicating factor. Despite the controlled test sets we observe strong frequency biases for the individual models as illustrated for Swedish in Figure 3. 18 Models rather correlate negatively with the minimum frequency of target words between corpora (FRQ m ), and positively with the change in their frequency across corpora (FRQ d ). This means that models predict higher change for low-frequency words and higher change for words with strong changes in frequency. Despite their superior performance, type embeddings are more  Table 6: Average and maximum performance of best submissions per subtask for different system types.
Submissions that corresponded exactly to the baselines or the sample submission were removed. strongly influenced by frequency than token embeddings, probably because the latter are not trained on the test corpora limiting the influence of frequency. Similar tendencies can be seen for the other languages. For a range of models correlations reach values > 0.8.
The influence of polysemy We did not control the test sets for polysemy. As shown in Table 4, the change scores for both subtasks are moderately to highly correlated with polysemy (PLY m ). Hence, it is expected that model predictions would be positively correlated with polysemy. However, these are in almost all cases lower than for the change scores and in some cases even negative (Latin and partly English). We conclude that model predictions are only moderately biased towards polysemy on our data.
Prediction difficulty of words In order to quantify how difficult a target word is to predict we compute the mean error of all participants' predictions. 19 In Subtask 1, we find that words with higher rank tend to have higher error, in particular for English, see Figure 4 (left) where words with the gold class 1 have almost twice as high average error than words with gold class 0, and Latin. This is likely due to the tendency for systems to provide zero-predictions following the published baselines. For Subtask 2 (right), we find that the opposite holds; stable words are harder to predict for all languages but Swedish, where instead, it seems that the words in the middle of the rank are the hardest to classify. For English, the top three hardest to predict words are for Subtask 1 vs. Subtask 2 are land, head, edge vs. word, head, multitude. For German, they are packen,überspannen, abgebrüht vs. packen, Seminar, vorliegen. For Latin, they are cohors, credo, virtus vs. virtus, fidelis, itero. For Swedish, they are kemisk, central, bearbeta vs. central, färg, blockera. We could not identify a general pattern with regards to these words' frequency or polysemy properties.

Conclusion
We presented the results of the first shared task on Unsupervised Lexical Semantic Change Detection. A wide range of systems were evaluated on two subtasks in four languages relying on a thoroughly annotated data set based on ∼100,000 human judgments. The task setup (unsupervised, no genuine development data, different corpora from different languages with very different sizes, varying class distributions) provided an opportunity to test models in heterogeneous learning scenarios, that was very challenging. Hence, both subtasks remain far from solved. However, several teams reach high performances on both subtasks. Surprisingly, type embeddings outperformed token embeddings on both subtasks. We suspect that the potential of token embeddings has not yet fully unfolded, as no canonical application concept is available and preprocessing was not optimal for token embeddings. We found that type embeddings are strongly influenced by frequency. Hence, one important challenge for future type-based models will be to avoid the frequency bias stemming from the corpus on which they are trained. An important challenge for token-based models will be to understand the reasons for their current low performance and to develop robust ways for their application. We found that change scores in our test sets strongly correlate with polysemy, despite model predictions not showing such strong influence. We believe that this should be pursued in the future by controlling test sets for polysemy. We hope that SemEval-2020 Task 1 makes a lasting contribution to the field of Unsupervised Lexical Semantic Change Detection by providing researchers with a standard evaluation framework and highquality data sets. Despite the limited size of the test sets, many previously reached conclusions can now be tested more thoroughly and future models can be compared on a shared benchmark. The current test set can also be used to test models that have been trained on the full data available for the participating corpora. Data from additional time periods can be utilized by models that need finer granularity for detection, while testing on the two time periods available in the current test sets.

A.1 Edge sampling
Retrieving the full usage graph is not feasible even for a small set of n uses as this implies annotating n * (n − 1)/2 edges. Hence, the main challenge with our annotation approach was to reduce the number of edges to annotate as few as possible, while keeping the necessary information needed to infer a meaningful clustering on the graph. We did this by annotating the data in several rounds. After each round the usage graph of a target word was updated with the new annotations and a new clustering was obtained. 20 Based on this clustering we sampled the edges for the next round applying simple heuristics similar to Biemann (2013). We spread the annotation load randomly over annotators making sure that roughly half of the use pairs is annotated by more than one annotator.
In the first round we aimed to obtain a small but good reference set of uses which would serve to compare the rest of uses in the second round. Hence, we sampled 10% of the uses from U and 30% of the edges from this sample by exploration, i.e., by a random walk through the sample graph guaranteeing that all nodes are connected by some path. Hence, the first clustering was obtained on a small but richly connected subgraph guaranteeing that we did not infer a larger number of clusters than present in the data in the first round, which would lead to a strong increase in annotation instances in the subsequent rounds. In all subsequent rounds we combined a combination step with an exploration step. A multi-cluster is a cluster with ≥ 2 uses. The combination step combined each single use u 1 which is not yet member of a multi-cluster with a random use u 2 from each of the multi-clusters to which u 1 had not yet been compared. The exploration step consisted of a random walk on 30% of the edges from the non-assignable uses, i.e., uses which had already been compared to each of the multi-clusters but were not assigned to any of these by the clustering algorithm. This procedure slowly populated the graph while minimizing the annotation of redundant information. The procedure stopped when each cluster had been compared to each other cluster. We validated the procedure in a simulation study (see below).
We combined the above procedure with further heuristics added after round 1: (i) we sampled a low number of randomly chosen edges and edges between already confirmed multi-clusters for further annotation to corroborate the inferred structure; (ii) we detected relevant disagreements between annotators, i.e., judgments with a difference of ≥ 2 on the scale or edges with a median ≈ 2.5, and redistributed the corresponding edges to another annotator to resolve the disagreements; and (iii) we detected clustering conflicts, i.e., positive edges between clusters and negative edges within clusters (see below) and sampled a new edge for each node connected by a conflicting edge. This added more information in regions of the graph where finding a good clustering was hard. Furthermore, after each round, we removed nodes from the graph whose 0-judgments (undecidable) made up more than half of their total judgments. We stopped the annotation after four rounds.

A.2 Example
Find an example of our annotation pipeline in Figure 5. As the annotation proceeds through the rounds the graph becomes more populated and the true cluster structure is found. In round 1 one multi-cluster is found. Hence, all remaining uses are compared with this cluster in round 2 by the combination step. In rounds 3 and 4 the exploration step discovers more clusters not found in the rounds before.

A.3 Simulation
We validated the annotation procedure and the clustering algorithm described in Section 4 in a simulation study by simulating 40 ground truth usage graphs with zipfian sense frequency distributions covering roughly the frequency range of the majority our target words (50-1000). We introduced change to half of the target words by setting some of its senses' frequencies to 0 in either of D 1 , D 2 . We then sampled from these graphs in several rounds as described above, simulated an annotation in each round with a normally distributed error added to judgments and compared the resulting clustering to the clustering of the true graph. The true clustering could be recovered with high accuracy (average of > .96 adjusted mean rand index). We also used the simulation to predict the feasibility of the study and to tune parameters of the annotation such as sample sizes for nodes and edges. With the finally chosen parameters described in Section 4.1 the algorithm converged on average after 5 rounds and ≈ 8000 judgments per annotator. This was within the bounds of our time limits and financial budget. We also tested the clustering algorithm against several standard techniques (Biemann, 2006;Blondel et al., 2008) and varied the optimization algorithm for L. None of these variations performed compatible with our approach. B Systems description cbk (Beck, 2020) The team obtains contextual embeddings using BERT , and extracts for every target word usage a word embedding using bert-as-service (Xiao, 2018). The team uses the difference of mean value of all cosine distances (Salton and McGill, 1983) of a target word between two corpora to detect change. cs2020 21 (Arefyev and Zhikov, 2020) The team submits systems of two types: SGNS with an Orthogonal Procrustes alignment and cosine distance as a change measure, and a variation of a word-sense induction method by Amrami and Goldberg (2018). For the latter, the team replaces BERT by a finetuned version of XLM-R  and for every target word generates lexical substitutes following Amrami and Goldberg (2019), the vectors of the most probable of which are then clustered using agglomerative clustering, with cosine distance.
Discovery Team (Martinc et al., 2020) The team uses two types of word representations: average embeddings from SGNS Mikolov et al., 2013b) with an Orthogonal Procrustes alignment and contextual embeddings using language-specific BERT . For SGNS+OP the team compares vectors using cosine (Salton and McGill, 1983), while contextual embeddings see two different strategies: averaging of target-word embeddings, and clustering using k-means and affinity -.0 --.0 --.0 --.0 --.0 - Table 7: Summary of the precision (P), recall (R), and F1 scores on Subtask 1 for the baseline systems and the systems which submitted a system description paper. 'Avg.' refers to the average across all languages for each system. The baseline systems and the submitting systems are ordered by decreasing F1 of their best submission calculated on the average over all languages.