DiaSense at SemEval-2020 Task 1: Modeling Sense Change via Pre-trained BERT Embeddings

This paper describes DiaSense, a system developed for Task 1 ‘Unsupervised Lexical Semantic Change Detection’ of SemEval 2020. In DiaSense, contextualized word embeddings are used to model word sense changes. This allows for the calculation of metrics which mimic human intuitions about the semantic relatedness between individual use pairs of a target word for the assessment of lexical semantic change. DiaSense is able to detect lexical semantic change in English, German, Latin and Swedish (accuracy = 0.728). Moreover, DiaSense differentiates between weak and strong change.


Introduction
Task 1 of SemEval-2020  is concerned with the unsupervised detection of lexical semantic change (LSC) as reflected by word sense changes over time. More broadly, LSC refers to changes in meaning of a lexical item. A meaning change is manifested in the gain or loss of a particular meaning of a word which indicates an increase or decrease in polysemy (Traugott and Dasher, 2001;Traugott, 2006). A well-known example for LSC which is cited in Tahmasebi et al. (2018) is the historical evolution of the English word hound which changed from being the general word for 'dog' to referring to only a specific kind of 'dog' ('narrowing', cf. Traugott (2006)). Meanwhile, dog changed from describing a specific type of 'dog' to becoming the general term for 'dog' ('broadening', cf. Traugott (2006)). Moreover, senses can become obsolete and be lost overall, while cultural changes may drive the evolution of new senses. The main task in SemEval-2020 Task 1 is to identify and evaluate LSC in a set of target words between two text corpora stemming from two different time periods t 1 (earlier period) and t 2 (later period). The investigated languages are German, English, Swedish and Latin. The task is split into two subtasks: (i) a binary classification task, where it has to be determined whether a target word lost/gained senses between t 1 and t 2 or not, and (ii) a ranking task, where target words are ranked according to their degree of LSC (a higher rank indicates stronger change).
The system presented in this paper is called DiaSense and addresses the LSC tasks by modeling the different senses of a word via contextualized word embeddings using pre-trained BERT (Devlin et al., 2018). 1 The binary classification and the ranking task are approached by transferring the measures of change suggested by Schlechtweg et al. (2018), which are originally based on human annotated values for meaning relatedness, to change metrics calculated on the basis of differences between target word embeddings. DiaSense is able to detect LSC in the majority of cases in all four languages (average accuracy 0.728). 2 Although the results produced for the ranking task only show a weak correlation with the gold data (Spearman's ρ = 0.337), DiaSense is able to distinguish between strong and weak change.

Related Work
Over recent years, research on LSC has seen an increasing use of computational methods (see Tahmasebi et al. (2018) and Kutuzov et al. (2018) for detailed overviews). The methods applied for LSC detection are manifold, but can be grouped into three classes according to the type of meaning representation involved : (i) semantic vector spaces, (ii) topic distributions, (iii) sense clusters.
In semantic vector space approaches, each target word is represented as a vector at each time stage. The vector representations are typically based on bag-of-words approaches and represent a co-occurrence statistics of a word with its context words. Common methods employed for computing vectors from co-occurrence statistics are Positive Pointwise Mutual Information (PPMI), which measures co-occurrence strength, and Singular Value Decomposition (SVD), for dimensionality reduction; see, e.g., Levy et al. (2015), Hamilton et al. (2016), Hellrich and Hahn (2017), Kahmann et al. (2017). Moreover, word embeddings as generated via the Skip-Gram with Negative Sampling (SGNS) technique (Mikolov et al., 2013), i.e., word2vec, and GloVe embeddings (Pennington et al., 2014) have been applied to LSC detection, e.g., Hamilton et al. (2016) and Hellrich and Hahn (2016). As measure of LSC, similarity across time periods is assessed via calculating the distance/similarity between vectors, using, e.g., cosine distance (Salton and McGill, 1983), or alternatively via computing differences in the contextual dispersion of the vectors (see, e.g., ). In approaches where meaning is represented via topic distributions (Bamman and Crane, 2011;Lau et al., 2012;Cook et al., 2014), word senses are derived from topic models based on, e.g., Latent Dirichlet Allocation (LDA; Blei et al. (2003)) and Hierarchical Dirichlet Processes (HDP, Teh et al. (2006)). Furthermore, the dynamic topic model SCAN was specifically developed for the investigation of lexical change (Frermann and Lapata, 2016). With topic modeling, LSC is usually assessed via a frequency-based novelty score assigned to the senses. Sense clustering based approaches follow similar principles, but are used less often; e.g., Mitra et al. (2015).
Recently, efforts have been made towards developing evaluation standards and datasets for LSC (Hamilton et al., 2016;Frermann and Lapata, 2016;Schlechtweg et al., 2018;Dubossarsky et al., 2019;. For example, Schlechtweg and im Walde (2020) generate simulations of LSC on the basis of synchronic data, providing a testbed for diachronic LSC, while  provide human annotations via the Diachronic Usage Relatedness (DURel) dataset. The current pitfall of the existing works on LSC is the lack of a common state-of-the-art evaluation task which makes the comparison of methods difficult. This shortcoming is addressed by SemEval-2020 Task 1.

System description
DiaSense measures change by combining word sense representations generated via BERT with LSC metrics which are based on the calculation of cosine distance as detailed in the following.

Word sense representations
DiaSense makes use of a semantic vector space approach to represent the lexical semantic content of the target words. In contrast to previous approaches, which employ static word embeddings as generated by, e.g., SGNS and GloVe, DiaSense is based on the state-of-the-art contextualized word embeddings provided by BERT. With static word embeddings, each word is represented via a single vector for each time period, which is shared by all senses of a polysemous word. Although some contextual information is captured, it is difficult to differentiate between the senses involved. This problem is alleviated by contextualized vector representations, where each vector is a function of a whole input sentence, keeping different word senses apart.
A further advantage of using BERT is that we can leverage the pre-trained models released by Google AI (Devlin et al., 2018) which spares us the task of training models by ourselves. Pre-trained contextualized embeddings have proven to be almost as effective as corresponding state-of-the-art models in linear NLP probing tasks such as part-of-speech tagging (Liu et al., 2019). Wiedemann et al. (2019) furthermore showed that pre-trained BERT allows for the disambiguation of polysemic words. Being able to use a pre-trained model is beneficial when working with historical data which is sparse by nature, with a lesser amount of training data available for the longer-standing past than for more recent time stages. Pre-trained static embeddings exist (e.g., fasttext 3 ), but are less applicable to historical data since they usually do not provide for out-of-vocabulary (OOV) words. Since the vocabulary in historical documents might differ substantially from the modern vocabulary used for the generation of word embeddings, historical documents are likely to contain a large amount of OOV words. The token-based approach employed by BERT on the other hand is designed to include OOV words, handling them via sub-word embeddings. Moreover, the pre-trained multilingual BERT embeddings allow for a language-independent approach to LSC, without having to scale to new languages by calculating new embedding matrices and parameters.
Our system is based on bert-as-service (Xiao, 2018), a Python library which uses Google's BERT model as sentence encoder, hosting it as service via ZeroMQ 4 . Bert-as-service is easy to implement and allows for the mapping of sentences into fixed-length BERT embeddings with just two simple lines of code. In DiaSense, LSC is assessed separately for each language, but we feed the same model to bert-as-service, i.e., the cased multilingual 12-layer BERT-base model. 5 By default, bert-as-service works on the second-to-last layer. Bert-as-service makes provision to get contextualized ('ELMo-like', cf. Peters et al. (2018)) word representations from the sentence embeddings. In DiaSense, we compute a sentence embedding for each sentence a target word occurs in via bert-as-service and take the corresponding target word embeddings to be representations of the target word's senses. This is done separately for t 1 and t 2 . If a target word has been tokenized into several subword units by BERT, we average over all subword embeddings which belong to the target word, taken from the corresponding sentence embedding.

LSC metrics
DiaSense was altered substantially in the post-evaluation phase, i.e., after the publication of the ground truth, with respect to the metrics employed for assessing LSC. We report on the metrics used in the evaluation and the post-evaluation phase in the following, but focus on the post-evaluation metrics, since these immensely improved the system's overall performance (see Section 5).
Evaluation phase In the evaluation phase, the binary classification task was approached via clustering. We clustered the target word embeddings generated via BERT using the KMeans algorithm as implemented in scikit-learn (Pedregosa et al., 2011), generating 'sense clusters' (with k = 2 as default). Change was then measured on the basis of a frequency threshold. That is, a word was classified as changing, when a cluster consisted of at least 90% of embeddings from one corpus only. The ranking task was addressed by calculating an average embedding for each target word in each corpus. Then, we measured the degree of change by computing the cosine distance between the average embedding from t 1 and the average embedding from t 2 of each target word. Cosine distance (cosine) between two (non-zero) vectors x and y is defined on the basis of their cosine similarity (sim), which corresponds to the dot product of the vectors divided by the product of their Euclidean lengths (Manning et al., 2008): A cosine distance value close to 0 indicates a low difference and a value close to 1 a high difference. Thus, we interpreted a large distance as high degree of change in the ranking task in the evaluation phase.
Post-evaluation phase To detect and measure LSC in the post-evaluation phase, DiaSense calculates several different metrics on the basis of the target word embeddings. The metrics are based on the measures provided by Schlechtweg et al. (2018) for the assessment of LSC change with respect to DURel annotations: ∆LATER and COMPARE. DURel contains gold standard annotations for 22 target words with respect to diachronic LSC in German. The annotations rest upon meaning relatedness scores assigned to sentence pairs in which a specific word occurs ('use pairs'), ranging from 1 (unrelated) to 4 (identical). The scores are inspired by Blank's (1997) semantic proximity continuum (proximity increases): homonymy > polysemy > context variance > identity. Thus, a high mean relatedness value between use pairs indicates meaning identity or context variance and a low value indicates polysemy or homonymy. According to this rationale, in a scenario of innovative meaning change from t 1 to t 2 (emergence of a new meaning), the meaning relatedness in t 2 should be lower than in t 1 , and vice versa when reductive meaning change (loss of a meaning) takes place. ∆LATER of a word w captures these intuitions and measures changes in the degree of mean relatedness by substracting w's mean value in t 1 (earlier) from the mean value in t 2 (later): ∆LATER(w) = M ean l (w) − M ean e (w). A high positive ∆LATER value shows an increasing relatedness over time and can be interpreted as reductive meaning change. A low negative ∆LATER in turn indicates innovative meaning change. In contrast to ∆LATER, COMPARE directly measures the relatedness of a word between t 1 and t 2 , via the mean value of relatedness scores assigned to use pairs which consist of a sentence from t 1 and a sentence from t 2 (with COMPARE(w) = M ean c (w)). COMPARE measures the degree of change (Schlechtweg et al., 2018), with a high value indicating weak and a low value indicating strong change.
Instead of assigning relatedness ranks to use pairs, DiaSense captures relatedness between target word embeddings via cosine distance. In doing so, a high cosine distance between target word embeddings can be interpreted as low meaning relatedness, while a low cosine distance value indicates a high meaning relatedness. To compute ∆LATER(w), we calculate the cosine distances between all embeddings of a target word in t 2 and assign the mean value of these distances to M ean l (w). We proceed similarly for t 1 to compute M ean e (w) and calulate ∆LATER(w) analogously to Schlechtweg et al. (2018) as difference between M ean l (w) and M ean e (w). Using cosine distance allows for the intuitive interpretation of a high positive value for ∆LATER as innovative meaning change and a low negative value as reductive meaning change. COMPARE(w) is computed by calculating the mean of cosine distances between all use pairs where one embedding is from t 1 and the other is from t 2 . In turn, the values must be interpreted inversely on the basis of cosine distance: a high COMPARE value indicates strong change, while a low value is an indicator of weak change.
Additionally, Schlechtweg et al. (2018) suggest for future research to normalize COMPARE with respect to polysemy in order to be able to differentiate between context variation and real diachronic change. Therefore, they propose to substract the mean relatedness value of the earlier time period, i.e., M ean e (w), from COMPARE and calculate ∆COMPARE in this way. However, this only captures the variation in the earlier period, not accounting for the whole variation present in the two corpora. Thus, instead of substracting the mean value of the earlier period, we propose to calculate ∆COMPARE by substracting the mean cosine distance between all target word embeddings from C 1 and C 2 , without differentiating between periods. In this way, we can capture the amount of within variation across both corpora. 6 4 Experimental setup Data For each language, two corpora C 1 and C 2 (for t 1 and t 2 ) and a set of target words were provided in the task. The corpora were pre-processed in that punctuation, empty and one word sentences were removed. Additionally, all sentences were lemmatized and are randomly shuffled within each corpus. This is meant to mimic the challenging nature of historical linguistic data, where incomprehensible and incomplete data is the norm rather than the exception. For English, C 1 and C 2 consist of data from the CCOHA corpus (Alatrash et al., 2020), representing the time stages 1810-1860 (t 1 ) and 1960-2010 (t 2 ). C 1 for German contains texts from 1800 to 1899 taken from the DTA corpus (Deutsches Textarchiv, 2017), and combines data from two newspaper corpora (Berliner Zeitung 7 , Neues Deutschland 8 ) for C 2 , with data from 1946 to 1990. For Latin, data was taken from the LatinISE corpus (McGillivray and Kilgarriff, 2013). C 1 features data from the beginning of the second century to the end of the first century BCE, while C 2 contains data from the beginning of the first century to the end of the twenty-first century CE. The Swedish corpora are based on data from KubHist (Asedam et al., 2019), with data from 1790-1830 for C 1 and from 1895-1903 for C 2 . Overall, the application scenario is broad with corpora covering four languages, whilst spanning over time stages which differ in terms of their chronological depth and length. Parameters of change For each language and each target word, we compute ∆LATER, COMPARE and ∆COMPARE. The BERT embeddings for each target word per time period are based on a maximum number of 500 sentences. 9 The binary classification task is addressed via ∆LATER since the metric measures changes in the mean relatedness of words over time (Schlechtweg et al., 2018). The target words are ranked according to ∆LATER and we take the top ranked target words, i.e., the highest positive and lowest negative values, to undergo LSC, see, e.g., Figure 1 for German. The thresholds for the binary classification were experimentally defined on the basis of these ranks and vary across languages. That is, we plotted the ranked target words as shown in Figure 1 and defined the thresholds on the basis of the points in the plot where the distribution begins to become skewed to the left and right respectively. The results for the ranking task are based on the absolute ∆COMPARE values, i.e., the normalized version of the COMPARE measure which differentiates between weak and strong change (Schlechtweg et al., 2018), thus measuring the degree of change. Moreover, we calculate the standard deviation (sd) of cosine distances in the earlier and in the later group to provide a measure of the context variation in each corpus.
Evaluation In SemEval-2020 Task1, the system is evaluated with respect to its performance on the binary classification and the ranking task. The evaluation of the binary classification output is based on accuracy measured against the true binary classification as annotated by humans. The output of the system for the ranking task is evaluated using Spearman's rank correlation coefficient (ρ) by calculating the correlation between the produced values and the true ranks as annotated by humans. For a detailed description of the gold data please see .

Results
DiaSense has been significantly improved after the publication of the ground truth for SemEval-2020 Task 1. Before this, the system showed a comparably low performance in the evaluation phase, with an average accuracy of 0.554 in the classification task (rank 17 of 21) and ρ = 0.234 in the ranking task (rank 14 of 21); see  for the full rankings. We attribute the low performance in the binary classification to two factors: For one, k = 2 might not have been the optimal parameter for KMeans clustering for all target words. We experimented with approaches to automatically determine k, but did not arrive at a suitable solution. For another, cluster initialization turned out to be difficult with the pre-trained BERT embeddings, since distances between the embeddings are generally low (cf. Reimers and Gurevych (2019) on clustering issues with pre-trained BERT embeddings). In addition, the frequency threshold employed for identifying change in the clustering was arbitrarily defined. The low performance in the ranking task could be the result of averaging the embeddings, where context variation might substantially bias the resulting vectors. Given these shortcomings, we decided to opt for alternative ways of measuring LSC, experimenting with ∆LATER and ∆COMPARE instead.
Currently (post-evaluation), the system ranks third in the binary classification with an overall average accuracy of 0.728 (English: 0.649, German: 0.771, Latin: 0.750, Swedish: 0.742). 10 In the ranking task, DiaSense occupies rank 14 with ρ = 0.337 (English: 0.293, German: 0.414, Latin: 0.343, Swedish: 0.300). Exemplary, we discuss the results for German in the following. 11 For German, the words with the highest positive ∆LATER (innovative meaning change) are abgebrüht 'boiled out/indifferent', abdecken 'cover/unroof/blanket', Tier 'animal', Festspiel 'festival', abbauen 'win (mining)/reduce', artikulieren 'articulate/enunciate', Titel 'title' and Rezeption 'reception' (see Figure 1left). In the ground truth, abgebrüht, Tier, Festspiel, and Titel are not classified as change. We can confirm Tier and Titel as false positive. Tier and Titel show high standard deviations in both t 1 and t 2 (with sd > 0.04), indicating that they exhibit a high context variation overall instead of undergoing a meaning change. However, standard deviation does not provide insights into whether Festspiel and abgebrüht are false positives. Instead, Festspiel shows a large difference between t 1 and t 2 based on a frequency effect (51 occurrences in C 1 , > 500 occurrences in C 2 ). Yet, since data sparsity is an inherent problem of historical corpora, we can not exclude Festspiel as easily. Moreover, abgebrüht indeed shows LSC on the basis of our data: In t 1 , abgebrüht is almost exclusively used as participle of the verb abbrühen 'boil out', while it occurs mostly as adjective with the more figurative meaning 'indifferent' in t 2 .
The target wordsüberspannen 'span/overstretch/straddle', Fuß 'foot', Abgesang 'last verse (minnesong)/swansong', Schmiere 'grease/lookout', Knotenpunkt 'junction/intersection', Ohrwurm 'earwig/catchy tune' have the lowest negative values (reductive meaning change), see Figure 1-right. Similarly to abgebrüht, Fuß is not classified as undergoing LSC in the gold data, but can in principle be identified as change: While in t 1 Fuß is still frequently used as measure unit, this meaning only occurs scarcely in t 2 . Overall, the system performs well when it comes to the identification of large changes. For example, Ohrwurm, which shows the highest (absolute) ∆LATER value in German, changed quite drastically from being mainly used in the meaning of 'earwig' in t 1 to almost exclusively denoting a 'catchy tune' in t 2 . However, the system fails to identify smaller scale changes such as, e.g., Manschette 'sleeve/cuff', where meanings are close and occur in similar contexts.
DiaSense performs less well in the ranking task as in the classification. However, although the correct ranking could not be identified, DiaSense puts the target words into similar regions, i.e., words with high ∆COMPARE values (e.g., Ohrwurm, abgebrüht) generally rank high, indicating strong change and vice versa. Moreover, without normalizing COMPARE, Titel and Tier ranked highest -an error which was avoided by using ∆COMPARE instead.

Conclusion
In this paper, we presented DiaSense, a system developed for SemEval-2020 Task 1. Based on contextualized word embeddings, DiaSense is able to identify change in English, German, Latin and Swedish, while also differentiating between weak and strong change. Our approach leverages the strength of pre-trained BERT embeddings for modeling word senses language-independently and avoids the necessity of large amounts of training data which is beneficial for historical linguistic work. Moreover, we were able to translate metrics developed to capture human intuitions about meaning relatedness into automated measures of LSC. Still, DiaSense was not able to predict the correct ranking in terms of degrees of LSC. We will address this issue in future research, experimenting with further measures and techniques.