DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

Word meaning is notoriously difficult to capture, both synchronically and diachronically. In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four different languages, based on 100,000 human semantic proximity judgments. We describe in detail the multi-round incremental annotation process, the choice for a clustering algorithm to group usages into senses, and possible – diachronic and synchronic – uses for this dataset.


Introduction
The view on word meaning and senses in computational linguistics has moved from a discrete (Weaver, 1949(Weaver, /1955;;Navigli, 2009) to a graded (McCarthy and Navigli, 2009;Erk et al., 2009Erk et al., , 2013;;Schlechtweg et al., 2018) perspective.However, scalable annotation strategies for this graded view yielding large-scale data for semantic evaluation have not been implemented yet.We build on two pre-existing schemata for graded contextual word meaning annotation (Erk et al., 2013) and show how they can be applied efficiently to create large-scale data in a diachronic setup.
Both procedures populate a Word Usage Graph (WUG, McCarthy et al., 2016;Schlechtweg et al., 2020) for a target word with annotator judgments.Procedure (i) requires annotators to judge usage pairs on a semantic proximity scale avoiding the a priori definition of word senses.This makes it preparation-lean and reduces experimenter influence.Procedure (ii) relies on a predefined list of senses and requires annotators to judge usagesense pairs on the same proximity scale as in procedure (i).Both procedures avoid binary assignments of word senses to word usages, which have been shown to be inadequate in many cases (Kilgarriff, 1997;Hanks, 2000;Kilgarriff, 2007).The resulting graphs relate word usages to each other (either directly or indirectly) and thus allow for a posteriori hard-or soft-clustering, where clusters can be interpreted as senses (Schütze, 1998;McCarthy et al., 2016;Schlechtweg et al., 2020).This makes the collapsing of senses possible, while allowing for sense overlap where this seems adequate after observing the annotated data.While both procedures require more judgments than traditional discrete word sense annotation, we show how the sampling of word usages can be optimized to reduce the number of necessary judgments.
We apply the above-described annotation procedures in a multi-lingual diachronic setup to create Diachronic WUGs (DWUGs).These contain annotations of the usages of a set of target words in corpora from two time periods (Schlechtweg et al., 2020).This allows us to identify changes in the WUGs over time.The final resource contains 168 DWUGs for four different languages (English (EN), German (DE), Swedish (SV), Latin (LA)) relying on approximately 100,000 human judgments.
1 After describing the annotation procedure, we provide a detailed analysis of annotator disagreements and evaluate the robustness of the annotated graphs.DWUGs can be exploited in many ways: • as large sets (thousands) of pairwise semantic proximity judgments to evaluate contextualized embeddings in multiple languages; • the inferred change scores can be used to evaluate semantic change detection models • as word sense disambiguation/discrimination resources with additional aspects such as variation over time; • the graphs may be treated as research objects in their own right, providing insights on cognitive aspects of word meaning and posing practical problems such as finding robust and efficient clustering algorithms.
There has been a significant shift in the view on word meaning and word senses in computational linguistics since the birth of the field.The early formulations of the Word Sense Disambiguation (WSD) task took a discrete view on word senses, assuming a fixed inventory of senses and a single best sense per word usage (Weaver, 1949(Weaver, /1955;;Navigli, 2009).After this view was shown empirically to be inadequate (Kilgarriff, 1997;Hanks, 2000;Kilgarriff, 2007), researchers have increasingly adopted a graded view on word senses, whereby a word usage may be assigned to multiple senses and more fine-grained distinctions are allowed within senses (McCarthy and Navigli, 2009;Erk et al., 2009Erk et al., , 2013)).Moreover, various approaches on how senses can be qualified have been proposed, starting from manual sense descriptions (Wilks and Keenan, 1975), to representing a sense solely by clusters of word usages (Schütze, 1998) or by lexical substitutes (McCarthy and Navigli, 2009).Recently, developments on computational models of the meaning of individual word usages (Peters et al., 2018;Devlin et al., 2019) have inspired new research on graded word meaning (Armendariz et al., 2019).

English
meaning.Our resource is related to discrete word sense annotation resources such as SemCor or OntoNotes in providing groups of word usages with the same/similar senses.However, they differ from those resources in the way in which senses are obtained, i.e., inferred on the pairwise annotated data and the graded nature of usage-usage and usage-sense comparisons.In this, our resources are strongly related to USim and WSim-2 (Erk et al., 2013), but differ from these by the additional diachronic dimension, the size of the graphs and the principled and robust approach to clustering.

Data
The data for annotation was sampled from two time-specific historical subcorpora for each language as summarized in Table 1.For English, we used the Clean Corpus of Historical American English (CCOHA, Davies, 2012;Alatrash et al., 2020), which spans 1810s-2000s.
For each language half of the target words (≈ 20) were chosen as words for which a change between C 1 and C 2 was described in etymological or historical dictionaries (OED, 2009;Paul, 2002;Clackson, 2011;Svenska Akademien, 2009).The other half was determined by sampling a control counterpart with the same POS and comparable frequency development between C 1 and C 2 as the corresponding target word.(For details refer to Schlechtweg et al. (2020).) 4 Procedure (i): Usage-Usage Graphs We first describe the procedure devised to annotate EN, DE and SV data and later describe the procedure for LA in Sec. 5. A usage-usage graph (UUG) G = (U, E, W) is a weighted, undirected graph, where nodes u ∈ U represent word usages and weights w ∈ W represent the semantic proximity of a pair of usages (an edge) (u 1 , u 2 ) ∈ E (McCarthy et al., 2016;Schlechtweg et al., 2020).In practice, semantic proximity can be measured by human annotator judgments on a scale of relatedness (Brown, 2008;Schlechtweg et al., 2018) or similarity (Erk et al., 2013).The annotation procedure starts from a non-annotated sample of word usages and aims to populate a UUG for each target word in several rounds of annotation with human judgments of semantic relatedness. 4Annotators were asked to judge the semantic relatedness of pairs of word usages using the scale in Table 2.
(1) and (2) show two example usages of the noun plane.
(1) Von Hassel replied that he had such faith in the plane that he had no hesitation about allowing his only son to become a Starfighter pilot.
(2) This point, where the rays pass through the perspective plane, is called the seat of their representation.
Figure 1 shows three UUGs resulting from our annotation.

Annotators
We started out with four annotators per language.Following high annotation loads and dropouts, additional annotators were hired, resulting in 9/8/5 total annotators for EN/DE/SV, respectively.All annotators were native speakers and current or former university students.The number of annotators with a background in historical linguistics was two for DE and one for EN and SV. 5

Usage sampling
We refer to an occurrence of a word w in a sentence by 'usage of w'.For each target word, 100 usages were randomly sampled from each of C 1 and C 2 (Table 1).Each usage contained the target word in its lemma form and a minimum of ten tokens, yielding a total of 200 usages per target word. 6If a target word had less than 100 usages, the full sample was annotated.The usage samples were subsequently mixed into a joint set U per target word.The set of usages U were annotated by presenting usage pairs to annotators in randomized order, hence, the annotators did not know from which time period each usage stemmed.

Edge sampling
Annotating the full usage graph is not feasible even for a small set of n usages as this implies annotating n * (n−1)/2 edges.Hence, the main challenge with this annotation approach was to annotate as few edges as possible, while keeping the information needed to infer a meaningful clustering on the graph.This was achieved by annotating the data in several rounds.After each round, the UUG of a target word was updated with the new annotations and a new clustering was obtained. 7Based on this clustering, the edges for the next round were sampled through heuristics similar to Biemann (2013).Nodes represent usages of the respective target word.Edge weights represent the median of relatedness judgments between usages (black/gray lines for high/low edge weights, i.e., weights ≥ 2.5/weights < 2.5).
The annotation load was randomly distributed making sure that roughly half of the usage pairs were annotated by more than one annotator.
The first round aimed to obtain a small highquality reference set of clusters.This was achieved through the sampling of 10% of the usages from U and 30% of the edges by a random walk through the sample graph (exploration), which guaranteed that all nodes are connected by some path.Hence, the first clustering was obtained on a small but richly-connected subgraph ensuring that not too many clusters were inferred, as this would lead to a strong increase in annotation instances in the subsequent rounds.In the second round, the reference clusters from the first round served as a comparison for those usages which were not assigned to a multi-cluster yet (combination).
8 In all subsequent rounds, both a combination step and an exploration step were employed.The combination step combined each single usage u 1 which is not yet member of a multi-cluster with a random usage u 2 from each of the multi-clusters to which u 1 had not yet been compared.The exploration step consisted of a random walk on 30% of the edges from the non-assignable usages, i.e., usages which had already been compared to each of the multi-clusters but were not assigned to any of these by the clustering algorithm.This procedure slowly populated the graph while minimizing the annotation of redundant information.We aimed to stop the procedure when each cluster had been compared to each other cluster.The sample sizes for the random walk were tuned and validated in a simulation study (Schlechtweg et al., 2020).
The above procedure was combined with further heuristics added after round 1 to increase the quality of the annotation: (i) sampling a low num-ber of randomly chosen edges and edges between already confirmed multi-clusters for further annotation to corroborate the inferred structure; (ii) detecting relevant disagreements between annotators, i.e., judgments with a difference of ≥ 2 on the scale or edges with a median ≈ 2.5, and redistributing the corresponding edges to another randomly chosen annotator from the ones who did not annotate the respective edge yet to resolve the disagreements; and (iii) detecting clustering conflicts, i.e., positive edges between clusters and negative edges within clusters (see below) and sampling a new edge for each node connected by a conflicting edge.This added more information in regions of the graph where finding a good clustering was hard.Furthermore, after each round, nodes from the graph whose 0-judgments (undecidable) made up more than half of their total judgments were removed, and in a few cases, whole words were removed if they had a high number of '0' judgments or needed a high number of further edges to be annotated.The annotation was stopped after four rounds for time constraints.(An example of our annotation pipeline can be found in Appendix A.)

Clustering
Tasks such as SemEval-2020 Task 1 require to derive a hard-clustering from the graphs. 9The UUGs obtained from the annotation were weighted, undirected, sparsely observed and noisy.This called for a robust clustering algorithm.For this, a variation of correlation clustering (Bansal et al., 2004;Schlechtweg et al., 2020) was employed minimizing the sum of cluster disagreements, i.e., the sum of negative edge weights within clusters plus the sum of positive edge weights across clusters.To That is, the sum of positive edge weights between clusters and (absolute) negative edge weights within clusters is minimized.Minimizing L is a discrete optimization problem which is NP-hard (Bansal et al., 2004), which is eased by the relatively low number of nodes (≤ 200).Hence, the global optimum can be approximated sufficiently with a standard optimization algorithm such as Simulated Annealing (Pincus, 1970): an algorithm that has shown superior performance in a previous simulation study by Schlechtweg et al. (2020).Since we do not have strong efficiency constraints, we follow the same procedure.In order to reduce the search space, we iterate over different values for the maximum number of clusters.We also iterate over randomly, as well as heuristically, chosen 5 Procedure (ii): Usage-Sense Graphs In this section, we describe the procedure devised to annotate the Latin data.This procedure is different from the other languages, as in a trial annotation task the annotators reported difficulties to judge usage-usage pairs.In consideration of this, usage-sense graphs were employed.Since we do not have access to native speakers of Latin, eight annotators with a high-level knowledge of Latin were recruited, ranging from undergraduate students to PhD students, post-doctoral researchers, and more senior researchers.

Usage-sense graphs
A usage-sense graph (USG) G = (V, E, W) is a weighted, undirected graph, whose nodes v ∈ V represent either word usages or sense descriptions and weights w ∈ W represent the semantic proximity of a usage-sense pair (u 1 , s 1 ) ∈ E. 11 We denote the set of word usages as U and the set of word sense descriptions as S, where V = U ∪ S.
Following Erk et al. (2013), semantic proximity can be measured by human annotator judgments on a similar scale as for USGs.Hence, we started from a non-annotated sample of usage-sense pairs and populated a USG for each target word with human judgments of semantic relatedness.Annotators were asked to judge the semantic relatedness of usage-sense pairs using the scale as for the other languages.(4) contains an example of a usagesense pair for sacramentum, displaying the older sense "a civil suit or process" (4) Usage: Cum Arretinae mulieris libertatem defenderem et Cotta xviris religionem iniecisset non posse nostrum sacramentum iustum iudicari, [. . .'When I was defending the liberty of a woman of Arretium, and when Cotta had suggested a scruple to the decemvirs that our action was not a regular one, [. . .] ' Figure 3 shows three USGs resulting from our annotation.The first word, pontifex, originally meant "a member of the college of priests having supreme control in matters of public religion in Rome", and with Christianity it acquired the sense of "bishop".
The three senses presented to the annotators were "priest, high priest", "Roman high-priest, a pontiff, pontifex", and "bishop".The first two correspond to the two red nodes in the bottom left corner of the first plot in Figure 3, and the last one corresponds to the top right red node.The plot of the second word, potestas shows the complex and highly related set of its senses, which can be summarised as: "Power of doing any thing"; "Political power"; "Magisterial power"; "Meaning of a word" (the isolated sense on the far right of the plot); "Force, efficacy"; "Angelic powers".The last plot refers to sacramentum and shows how the two senses "military oath of allegiance" and "oath" are close together on the top left of the plot, while the legal sense "a civil suit or process" is separated from the others in the top right corner and the Christian sense of "sacrament" is at the bottom right corner.

Usage and sense sampling
For each target word, 30 usages from each of C 1 and C 2 containing ≥ 2 tokens were randomly sampled, yielding a total of 60 usages per target word.The sense definitions were taken from the Latin portion of the Logeion online dictionary.13Due to the challenge of finding qualified annotators, each word was assigned to one annotator, apart from virtus, which was annotated by four annotators and used for inter-annotator agreement (Table 3).The annotators could add comments to their annotations.The senses and usages were presented in randomized order to the annotators.

Edge sampling
Procedure (ii) has an upper bound on the total number of annotated usage-sense pairs of n × k with k senses for n usages.The number of senses ranged between 2 and 7 with a usage sample size of 60 which yielded a good number of annotation instances.Hence, no further optimization of the edge sampling procedure was carried out.Note though that a similar optimization as for procedure (i) would be possible by annotating the data incrementally or by randomly subsampling edges.

Clustering
From the annotation, USGs where each usage is connected to each sense by one edge (see Figure 3) were obtained.Therefore there is a first-order path between each usage-sense pair and a second-order path between each usage-usage pair.Similarly to UUGs, we wanted to assign usages and senses into the same cluster if they received high judgments (3, 4) and into different clusters if they received low judgments (1, 2).We used the same clustering algorithm as for UUGs, defined in Section 4.4.In this way, usages end up in the same cluster if they have high judgments with the same senses.If there are contradictory judgments (e.g. a usage has high judgments with several senses), the clustering uses the global information to decide on the cluster assignment by choosing the one with the lowest loss.This can also lead to the collapsing of two sense descriptions into one cluster, e.g. for Latin sacramentum in Figure 4.

Resource
A summary of the annotation outcome for each language can be found in Table 3: Overview target words.
LGS = language, n = no. of target words, N/V/A = no. of nouns/verbs/adjectives, |U | = avg.no.usages per word, AN = no. of annotators, JUD = total no. of judged usage pairs, AV = avg.no. of judgments per usage pair, SPR = weighted mean of pairwise Spearman in round 1, KRI = Krippendorff's alpha in round 1, LOSS = avg. of normalized clustering loss * 10. words for DE. 14 We report two annotation agreement measures: mean pairwise Spearman correlations (Bolboaca and Jäntschi, 2006) between annotator judgments and Krippendorff's alpha (Krippendorff, 2004) for judgments' consensus, both reaching comparable scores to previous studies (Erk et al., 2013;Schlechtweg et al., 2018;Rodina and Kutuzov, 2020).The clustering loss is the value of L (Definition 3) divided by the maximum possible loss on the respective graph.It gives a measure of how well the graphs could be partitioned into clusters by the L criterion.In total, roughly 100,000 judgments were made by annotators.For EN/DE/SV ≈50% of the usage pairs were annotated by more than one annotator, while for LA each target word but one was annotated by one annotator.
Figure 5 shows the frequencies of annotator judgments over the DURel scale by language.On the UUGs (EN/DE/SV) judgment '4' is most frequent followed either by judgment '2' (EN/DE) or '1' (SV).Least frequent are judgments of '0' ('Cannot decide').Swedish has a considerably higher number of '0' judgments, presumably because of frequent OCR errors.On the USGs (LA) judgments of '1' are clearly most frequent, followed by '4'.This is because each usage is judged against each sense description which can often be unrelated.It can be seen that annotators make frequent use of the intermediate levels of the scale ('2', '3') and thus assign graded distinctions of word meaning.

Annotator disagreements
Roughly half of all edges were annotated by only one annotator.In order to estimate the reliability of these annotations we report disagreement frequencies on all edges with two judgments as displayed in Figure 6.Annotator pairs agree on 61-69% of these edges across languages, while they disagree by one point on the scale on 27-34%.Stronger disagreements are very rare with less than 5%.
We further analyze annotator disagreements on a subset of words from the DWUG DE data set covering different POS (abbauen (VB), abgebrüht (ADJ), Knotenpunkt (NN), Manschette (NN), zersetzen (VB)); we extract edges where at least one annotator pair diverges by at least two points on the DURel scale in Table 2 (e.g.1/3).We identify 5 sources of disagreement: • ambiguity • meaning unfamiliarity • misleading context • unclear meaning abstraction level • different intuitions on semantic proximity Most cases of disagreements between annotators can be traced back to ambiguity or meaning unfamiliarity with one of the usages.( (5) is a case of ambiguity: abgebrüht modifies an animal which could be "blanched" in the literal sense, but could also mean "hard-nosed" as the animal is further attributed with a "hard glance".Often ambiguity is also triggered by missing sentence context.( 6) is a short sentence which gives little clues on the meaning of the target word.Manschetten is at least ambiguous between a "fear", a "cuff" and a "collar" reading.In (7) abgebaut occurs in an archaic sense which was only observed once in our data and is likely unfamiliar to annotators.The context and its other senses suggest a meaning like "to destroy, to deprive", but the exact meaning is unclear.Further cases include usages with misleading context where a superficial reading or certain key words suggest a specific reading, while a deeper reading suggests another, and usages where the meaning of the target word could be described on various abstraction levels.There are also a few cases where the above categories do not apply, which may be due to (genuinely) different intuitions on semantic proximity.

Robustness
To estimate if the clustering method is sensitive to spurious errors in the annotation procedure, we tested the robustness of our results to perturbations in the graphs' weights.We replaced existing annotations with random scores (i.e., changing scores only for existing annotation pairs), created new graphs, and clustered them.We then compared the similarity between the clusters in the original graphs, which we viewed as true labels, to those of the manipulated graphs using cluster accuracy.This analysis, computed on English graphs (Figure 7), demonstrates that the cluster structure of the graphs is robust under relatively high degree of random annotations: at an error rate of 25% of the annotations, the manipulated graphs have cluster accuracy greater than 80% on average.

Conclusion
We described the creation of the largest existing resource of word usage graphs that capture graded, contextualized word meaning for four languages, namely English, German, Swedish and Latin.We detailed the annotation procedure, including the sampling aimed to reduce annotation effort while keeping a high density in regions where annotators have difficulty judging relatedness.The usage graphs have been clustered and we openly release clusterings, visualizations and an analysis of the clustering results.This resource has been used for the SemEval 2020 task on unsupervised lexical semantic change detection, but its possibilities are much broader and range from the use of different clustering techniques, including soft-clustering, to the use as ground truth for diachronic word sense disambiguation or temporal classification of sentences.The corpora used and some aspects of the annotation procedure were different for Latin, and this was a necessary choice due to the lack of native speakers for this language and to the nature of the texts at our disposal.Offering a resource for Latin attests to the methodological and intellectual contribution of our work and we believe in the value of working on lexical semantic change for a historical language.
Future work entails annotating additional critical edges to allow for better understanding of robustness; how much annotation is needed for different kinds of words?Knowing that some words, e.g., single-sense concrete words, require less annotation allows us to spend more effort on abstract and highly polysemous words.We will also analyze the influence of edge sparsity and ambiguity on the clustering procedure and compare its output to other annotation strategies.

A Annotation pipeline example
Figure 8 shows an example of our annotation pipeline.As the annotation proceeds through the rounds, the graph becomes more populated and the true cluster structure is found.In round 1 one multicluster is found.Hence, all remaining usages are compared with this cluster in round 2 by the combination step.In rounds 3 and 4 the exploration step discovers more clusters not found in the rounds before.

Figure 1 :
Figure 1: Usage-usage graphs of English plane (left), German ausspannen (middle) and Swedish ledning (right).Nodes represent usages of the respective target word.Edge weights represent the median of relatedness judgments between usages (black/gray lines for high/low edge weights, i.e., weights ≥ 2.5/weights < 2.5).

Figure 2 :
Figure 2: Usage-usage graph of Swedish ledning (left), subgraph for first time period C 1 (middle) and second time period C 2 (right).
initial clustering states. 10This way of clustering usage graphs has several advantages: (i) It finds the optimal number of clusters on its own.(ii) It easily handles missing information (non-observed edges).(iii) It is robust to errors by using the global information on the graph.That is, one wrong judgment can be outweighed by correct ones.(iv) It directly optimizes an intuitive quality criterion on usage graphs.Many other clustering algorithms such as Chinese Whispers (Biemann, 2006) make local decisions, so that the final solution is not guaranteed to optimize a global criterion such as L. (v) By weighing each edge with its (shifted) weight, L respects the gradedness of word meaning.That is, edges with |W ′ (e)| ≈ 0 have less influence on L than edges with |W ′ (e)| ≈ 1.5.The clustered graphs are provided with the published data.Figure 2 (G) shows the annotated and clustered UUG for SV ledning.Nodes represent usages of the target word (isolates removed).Edges represent the median of relatedness judgments between usages.Colors make clusters (senses) inferred on the full graph G. G 1 (left) and G 2 (right) represent the time-specific subgraphs resulting from removing the respective nodes and their edges for each time period (C 1 , C 2 ) from the full graph.

Figure 4 :
Figure 4: Usage-sense graph of Latin sacramentum (left), subgraph for first time period C 1 (middle) and second time period C 2 (right).

Figure 6 :
Figure 6: Disagreement frequency on edges with two annotations.Numbers in legend correspond to disagreements by points on the DURel scale.

Figure 7 :
Figure 7: Mean cluster accuracies and CI (y axis) for increasing proportions of random annotations (x axis).

Figure 8 :
Figure 8: Simulated example of annotation pipeline.

Table 3 .
The final resource contains 40 words for EN/SV/LA, and 48LGS n N/V/A |U| AN JUD AV SPR KRI LOSS 5) das war ein finsterer Herr mit dem harten Blick eines abgebrühten Schellfisches.'that was a sinister gentleman with the hard look of a blanched/hard-nosed haddock' the Association for Computational Linguistics: Human Language Technologies, pages169-174, New  Orleans, Louisiana.Hinrich Schütze.1998.Automatic word sense discrimination.Computational Linguistics, 24(1):97-123.A. Soares da Silva.1992.Homonímia e polissemia: Análise sémica e teoria do campoléxico.In Actas do XIX Congreso Internacional de Lingüística e Filoloxía Románicas, volume 2 of Lexicoloxía e Metalexicografía, pages 257-287, La Coruña.Fundación Pedro Barrié de la Maza.Språkbanken.downloaded in 2019.The Kubhist Corpus, v2.Department of Swedish, University of Gothenburg.Svenska Akademien.2009.Contemporary dictionary of the Swedish Academy.The changed words are extracted from a database managed by the research group that develops the Contemporary dictionary.Nina Tahmasebi and Thomas Risse.2017.Finding individual word sense changes and their delay in appearance.In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 741-749, Varna, Bulgaria.Warren Weaver.1949/1955.Translation.In William N. Locke and A. Donald Boothe, editors, Machine Translation of Languages, pages 15-23.MIT Press, Cambridge, MA.Reprinted from a memorandum written by Weaver in 1949.Yorick Wilks and Edward L. Keenan.1975.Preference Semantics, page 329-348.Cambridge University Press.