UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection

We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.


Introduction
Words change their meaning over time in all natural languages. The emergence of large representative historical corpora and powerful data-driven semantic models has allowed researchers to track meaning change. SemEval-2020 Shared Task 1 challenges its participants to classify a list of target words into stable or changed (Subtask 1) and/or to rank these words by the degree of their semantic change (Subtask 2) (Schlechtweg et al., 2020). The task is multilingual: it includes four lists of target words, respectively for English, German, Latin, and Swedish (several dozen words each). Each word list is accompanied with two historical corpora of varying size, consisting of texts created in two different time periods. The shared task organisers additionally provided frequency-based and distributional baseline methods.
We participated in Subtask 2 1 as the UiO-UvA team and investigated the potential of contextualised embeddings for the detection of lexical semantic change. Our systems are based on ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) language models, and employ 3 different algorithms to compare contextualised embeddings diachronically. Our evaluation phase submission to the shared task ranked 9 th out of 34 participating teams, while our post-evaluation phase submission is the best from those published on the shared task website 2 (but some knowledge of the test sets statistics was needed, see below). In this paper, we extensively evaluate all combinations of architectures, training corpora and change detection algorithms, using 5 test sets in 4 languages.
Our main findings are twofold: 1) In 3 out of 5 test sets, ELMo consistently outperforms BERT, while being much faster in training and inference; 2) Cosine similarity of averaged contextualised embeddings and average pairwise distance between contextualised embeddings are the two best-performing change detection algorithms, but different test sets show strong preference to either the former or the latter. This preference shows strong correlation with the distribution of gold scores in a test set. While it may indicate that there is a bias in the available test sets, this finding remains yet unexplained.
Our implementations of all the evaluated algorithms are available at https://github.com/ akutuzov/semeval2020, and the ELMo models we trained can be downloaded from the NLPL vector repository at http://vectors.nlpl.eu/repository/ (Fares et al., 2017

Background
Lexical semantic change detection (LSCD) task is determining whether and/or to what extent the meaning of a set of target words has changed over time, with the help of time-annotated corpora (Kutuzov et al., 2018;Tahmasebi et al., 2018). LSCD is often addressed using distributional semantic models: timesensitive word representations (so-called 'diachronic word embeddings') are learned and then compared between different time periods. This is a fast developing NLP sub-field with a number of influential papers ( (Gulordava and Baroni, 2011;Kim et al., 2014;Kulkarni et al., 2015;Hamilton et al., 2016;Bamler and Mandt, 2017;Del Tredici et al., 2019;Dubossarsky et al., 2019), and many others).
Most of the previous work used some variation of 'static' word embeddings where each occurrence of a word form is assigned the same vector representation independently of its context. Recent contextualised architectures allow for overcoming this limitation by taking sentential context into account when inferring word token representations. However, application of such architectures to diachronic semantic change detection was up to now limited (Hu et al., 2019;Giulianelli et al., 2020;Martinc et al., 2020). While all these studies use BERT as their contextualising architecture, we extend our analysis to ELMo and perform a systematic evaluation of various semantic change detection approaches for both language models.
ELMo (Peters et al., 2018) was arguably the first contextualised word embedding architecture to attract wide attention of the NLP community. The network architecture consists of a two-layer Bidirectional LSTM on top of a convolutional layer. BERT (Devlin et al., 2019) is a Transformer with self-attention trained on masked language modelling and next sentence prediction. While Transformer architectures have been shown to outperform recurrent ones in many NLP tasks, ELMo allows faster training and inference than BERT, making it more convenient to experiment with different training corpora and hyperparameters.

System overview
Our approach relies on contextualised representations of word occurrences often referred to as contextualised embeddings). Given two time periods t 1 , t 2 , two corpora C 1 , C 2 , and a set of target words, we use a neural language model to obtain contextualised embeddings of each occurrence of the target words in C 1 and C 2 and use them to compute a continuous change score. This score indicates the degree of semantic change undergone by a word between t 1 and t 2 , and the target words are ranked by its value.
More precisely, given a target word w and its sentential context s = (v 1 , ..., v i , ..., v m ) with w = v i , we extract the activations of a language model's hidden layers for sentence position i. The N w contextualised embeddings collected for w can be represented as the usage matrix U w = (w 1 , . . . , w Nw ). The timespecific usage matrices U 1 w , U 2 w for time periods t 1 and t 2 are used as input to all the tested metrics of semantic change. We use three change detection algorithms: 1. Inverted cosine similarity over word prototypes (PRT) Given two usage matrices U t 1 w , U t 2 w , the degree of change of w is calculated as the inverted similarity 3 between the average token embeddings ('prototypes') of all occurrences of w in the two time periods: where N t 1 w and N t 2 w are the number of occurrences of w in time periods t 1 and t 2 , and d is a similarity metric, for which we use cosine similarity. This method is similar to the standard LSCD workflow with static embeddings produced by Procrustes-aligned time-specific distributional models (Hamilton et al., 2016), with the only additional step of averaging token embeddings to create a single vector. Since we want the algorithm to produce higher scores for the words which changed more, the inverted value of cosine similarity is used as the prediction.
2. Average pairwise cosine distance between token embeddings (APD) Here, the degree of change of w is measured as the average distance between any two embeddings from different time periods (Giulianelli et al., 2020): where d is the cosine distance. High APD values indicate stronger semantic change.
3. Jensen-Shannon divergence (JSD) This measure relies on the partitioning of embeddings into clusters of similar word usages. We follow Giulianelli et al. (2020) and create a single usage matrix with occurrences from two corpora [U t 1 w ; U t 2 w ]. We then standardise it and cluster its entries using Affinity Propagation (Frey and Dueck, 2007), which automatically selects a number of clusters for each word (Martinc et al., 2020). Finally, we define probability distributions u t 1 w , u t 2 w based on the normalised counts of word occurrences from each cluster (Hu et al., 2019) and compute a JSD score (Lin, 1991): JSD scores measure the amount of change in the proportions of word usage clusters across time periods. A high JSD score indicates a high degree of lexical semantic change.

Experimental setup
For each of the 4 languages of the shared task, we train 4 ELMo model variants: 1) Pre-trained, an ELMo model trained on the respective Wikipedia corpus (English, German, Latin or Swedish) 4 ; 2) Fine-tuned, the same as Pre-trained but further fine-tuned on the union of the two test corpora; 3) Trained on test, trained only on the union of the two test corpora; 4) Incremental, two models-the first is trained on the first test corpus, and the second is the same model further trained on the second test corpus. The ELMo models are trained for 3 epochs (except English and Latin Trained on test and Incremental models, for which we use 5 epochs, due to small test corpora sizes), with the LSTM dimensionality of 2048, batch size 192 and 4096 negative samples per batch. All the other hyperparameters are left at their default values. 5 For BERT, we use the base version, with 12 layers and 768 hidden dimensions. 6 For English, German and Swedish, we employ language-specific models: bert-base-uncased, bert-base-german-cased, and af-ai-center/bert-base-swedish-uncased. For Latin, we resort to bert-base-multilingual-cased, since there is no specific Latin BERT available yet. Given the limited size of the test corpora (in the order of 10 8 word tokens at max), we do not train BERT from scratch and only test the Pre-trained and Fine-tuned BERT variants. The fine-tuning is done with BERT's standard objective for 2 epochs (English was trained for 5 epochs). We configure BERT's WordPiece tokeniser to never split any occurrences of the target words (some target words are split by default into character sequences) and we add unknown target words to BERT's vocabulary. We perform this step both before fine-tuning and before the extraction of contextualised representations.
At inference time, we use all ELMo and BERT variants to produce contextualised representations of all the occurrences of each target word in the test corpora. For the Incremental variant, the representations for the occurrences in each of the two test corpora are produced using the respective model trained on this corpus. The resulting embeddings are of size 12 × 768 and 3 × 512 for BERT and ELMo, respectively. We employ three strategies to reduce their dimensionality to that of a single layer: 1) using only the top layer, 2) averaging all layers, 3) averaging the last four layers (only for BERT embeddings, as this aggregation method was shown to work on par with the all-layers alternative in (Devlin et al., 2019)). Finally, to predict the strength of semantic change of each target word between the two test corpora, we feed the word's contextualised embeddings into the three algorithms of semantic change estimation described in Section 3. We then compute the Spearman correlation of the estimated change scores with the gold answers. This is the evaluation metric of Subtask 2, and we use it throughout our experiments.

Results
Our submission. In our 'official' shared task submission in the evaluation phase we used top-layer ELMo embeddings with the cosine similarity change detection algorithm for all languages. English and German ELMo models were trained on the respective Wikipedia corpora. For Swedish and Latin, pretrained ELMo models were not available, so we trained our own models on the union of the test corpora. This combination of architectures and algorithms was chosen based on our preliminary experiments with the available human-annotated semantic change datasets for English (Gulordava and Baroni, 2011), German (Schlechtweg et al., 2018) and Russian (Fomin et al., 2019). The resulting Spearman correlations were 0.136 for English, 0.695 for German, 0.370 for Latin, and 0.278 for Swedish. With an average Codalab score of 0.37, this submission ranked 9 th out of 34 teams in the evaluation phase.
We were aware that the submitted setup was likely sub-optimal as it did not include the Fine-tuned model variant. After the official submission deadline (in the post-evaluation phase), we finished training and fine-tuning all of our language models. Their systematic evaluation is the main contribution of this paper.
Current results. The average scores of all the tested configurations across 4 languages are given in Table 1. This table includes both the results of the configurations we used in the evaluation phase and the results of the configurations we tested after the submission deadline (fine-tuned models). We compare our scores to the organisers' baselines (FD and CNT+CI+CD, as provided by Schlechtweg et al. (2020)) and the classical approach of calculating cosine distance between static CBOW word embeddings (Mikolov et al., 2013). The CBOW models were used in two different flavors: 1) 'incremental', where the C 2 model was initialised with the C 1 weights (Kim et al., 2014), and 2) 'Procrustes', where the two models were trained independently on C 1 and C 2 , and then aligned using the Orthogonal Procrustes transformation (Hamilton et al., 2016). Note that although we did not try different hyperparameter combinations for the static embeddings (varying vector dimensionalities, learning rates, vocabulary sizes, etc), we did not do that for contextualised embeddings either. All the training hyperparameters for both ELMo and BERT were fixed to their default values (see section 4), we varied only the training corpora and the layers from which embeddings were extracted. Thus, we believe that static and contextualised embedding-based approaches were under a fair comparison. Table 1 shows that no method achieves statistically significant correlation on all 4 languages, which attests both to the difficulty of the task and the diversity of the test sets. CBOW Procrustes is a surprisingly strong approach, consistently outperforming the organisers' baselines. Only PRT and APD obtain higher average scores, with fine-tuned ELMo models performing better than fine-tuned BERT.
Judging only from the average correlation scores, contextualised embeddings do not seem to outshine their static counterparts, especially considering that both ELMo and BERT are more computationally demanding than CBOW. However, closer analysis of per-language results shows that in fact the contextualised approaches outperform the CBOW Procrustes baseline by a large margin for each of the shared task test sets. Table 2 features the scores obtained by our best-performing methods (PRT and APD with top layer embeddings from fine-tuned ELMo and BERT) on the individual languages of the shared task. We also report performance on the GEMS ('GEometrical Models of Natural Language Semantics workshop') test set (Gulordava and Baroni, 2011) to enable a comparison with previous work (Giulianelli et al., 2020). The discrepancy between the averaged and the per-language results can be explained by properties of the test sets: APD works best on the English and Swedish sets, while PRT yields the best scores for German and Latin.
Although consistency across languages (3 out of 4) is an important benefit of the CBOW Procrustes approach, with the right choice of APD or PRT, contextualised embeddings can improve Spearman correlation coefficients by up to 50%. This is not a language-specific property: the English GEMS test  Table 1: Spearman correlation coefficients for Subtask 2 averaged over four languages. The number of asterisks denotes the number of languages for which the correlation was statistically significant (p < 0.05).
set does not behave like the English test set from the shared task. In fact, one can observe 3 groups of test sets with regards to their statistical properties and to the method they favour: group 1 (Latin and German) exhibits rather uniform gold score distributions and prefers PRT; group 2 (English and Swedish) is characterised by more skewed gold score distributions and prefers APD; group 3 (GEMS) is in between, with no clear preference. Interestingly, the method which produces a more uniform predicted score distribution (APD) works better for the test sets with skewed gold distributions, and the method which produces a more skewed predicted score distribution (PRT) works better for the uniformly distributed test sets (as can be seen in the Appendix). Furthermore, there is perfect negative correlation (ρ = −1) between the median gold score of a test set and the performance of the APD algorithm with fine-tuned ELMo models on this test set. The same correlation for the APD performance is not significant but strictly negative. We currently do not have a plausible explanation for this behaviour. Table 2 also supports the previous observation that ELMo models perform better than BERT in the LSCD task. The only test set for which this is not the case is Latin, 7 while on GEMS, ELMo and BERT are on par. One possible explanation is that our ELMo models were pre-trained on lemmatised Wikipedia corpora and thus better fit the test corpora, provided in lemmatised form by the organisers. The BERT  Table 2: Spearman correlation per test set for our best methods (post-evaluation phase). † marks statistical significance (p < 0.05).
models were pre-trained on raw corpora, and fine-tuning them on lemmatised data proves less successful. This is of course not an advantage of the ELMo architecture per se; however, easy and fast training from scratch on the respective Wikipedia corpora for each shared task language was possible only because of much lower computational requirements of ELMo compared to BERT. 8 In the post-evaluation phase of the shared task, we submitted predictions obtained with the optimal system configurations: fine-tuned ELMo + APD for English and Swedish, fine-tuned ELMo + PRT for German, and fine-tuned BERT + PRT for Latin. This submission reached the average Spearman correlation of 0.618 and, at the time of writing, it is the best Subtask 2 submission for SemEval-2020 Task 1 (among those publicly available on the shared task website). Of course, the optimal choice of configurations was possible only because we already knew the test data. Still, it is useful for understanding of the real abilities of contextualised embedding-based approaches and the peculiarities of different models and test sets.

Conclusion
Our experiments for the SemEval-2020 Shared Task 1 (Subtask 2) show that using contextualised embeddings to rank words by the degree of their semantic change yields strong correlation with human judgements, outperforming approaches based on static embeddings. Models pre-trained on large external corpora and fine-tuned on the historical test corpora produce the highest correlation results, with ELMo slightly but consistently outperforming BERT as a contextualiser.
Inverted cosine similarity between averaged contextualised embeddings and the average pairwise cosine distance between contextualised embeddings turned out to be the best semantic change detection algorithms. An interesting finding is that the former algorithm favours the test sets with uniform gold score distribution, while the latter works best with the test sets where the gold score distribution is skewed towards low values. This distinction is not related to the language of the test set, or any other linguistic or statistical property of the test sets we looked at. While slightly different human annotation protocols can be in play here, we believe this dependency between the gold scores distribution and the performance of semantic change detection systems deserves to be investigated further in future work.
We also intend to explore the possibilities to improve our best-performing methods (PRT and APD), especially with regards to removing the outlier embeddings before calculating the semantic change score.
The grouping differences can be quantified with respect to the median gold score (after unitnormalisation). Figure 2 shows the dependency of the PRT and APD performance on the median score of the gold test set. The dots here are the performance values of PRT or APD algorithms on different test sets. English and Swedish test sets are in the left part of the plot with the median gold scores of 0.200 and 0.203 correspondingly. German, GEMS and Latin are on the right with 0.266, 0.267 and 0.364 correspondingly. There is a perfect negative Spearman correlation between the median gold scores of these 5 test sets and the performance of APD semantic change detection algorithm on each of them (with fine-tuned ELMo embeddings).