Improving Diachronic Word Sense Induction with a Nonparametric Bayesian method

Diachronic Word Sense Induction (DWSI) is the task of inducing the temporal representations of a word meaning from the context, as a set of senses and their prevalence over time. We introduce two new models for DWSI, based on topic modelling techniques: one is based on Hierarchical Dirichlet Processes (HDP), a nonparametric model; the other is based on the Dynamic Embedded Topic Model (DETM), a recent dynamic neural model. We evaluate these models against two state of the art DWSI models, using a time-stamped labelled dataset from the biomedical domain. We demonstrate that the two proposed models perform better than the state of the art. In particular, the HDP-based model drastically out-performs all the other models, including the dynamic neural models. 1


Introduction
Word meanings evolve over time.Recent research works have focused on how to model such dynamic behaviour.The unsupervised task of Diachronic Word Sense Induction (DWSI) aims to capture how the meaning of a word varies continuously over time, in particular when new senses appear or old senses disappear.DWSI takes the time dimension into account and assumes that the data spans over a long continuous period of time in order to model the progressive evolution of senses across time.
The dynamic behaviour of words contributes to semantic ambiguity, which is a challenge in many NLP tasks.DWSI can serve as an analytical tool to help building terminology resources and indexing documents more accurately and therefore can be beneficial for information retrieval tasks.DWSI follows the probabilistic graphical modelling approach to approximate the true meanings from the observed data.Thus, in this paper, we explore the relation of DWSI with topic modelling in general and to the dynamic topic modelling techniques in particular: they both aim to discover a latent variable (sense or topic respectively) from a sequential collection of documents.Despite a close relation between the tasks, topic modelling techniques are not fully explored or compared against in the current state of the art of DWSI.
The state of the art of DWSI consists of only two models: (Emms and Kumar Jayapal, 2016) and (Frermann and Lapata, 2016).They are both designed specifically for DWSI; both are parametric; and both are dynamic, in the sense that they both introduce a time variable into the model in order to capture the evolution of the meaning over time.Emms and Kumar Jayapal (2016) propose a parametric generative model (NEO) where each sense is represented as a |V |-dimensional multinomial distribution over the vocabulary V , each document is represented as a mixture of senses, and the dependency of the sense proportions on time is represented as a K-dimensional multinomial distribution over the K senses.The parameters of the model have finite Dirichlet priors.A more complex model called SCAN (Frermann and Lapata, 2016) allows each sense distribution over the vocabulary to evolve sequentially from adjacent time slices, as well as the senses proportion.The multinomial parameters of words and senses have logistic normal priors.
The two above-mentioned models are parametric, in the sense that the number of senses (which reflects the structure of the hidden meanings in the data) is a hyper-parameter which has to be known a priori.This is not ideal given the nature of the DWSI task, which is meant to infer senses from the data.The same issue has been studied for the tasks of topic modelling and WSI; Hierarchical Dirichlet Processes (HDP), a nonparametric hierarchical model introduced by Teh et al. (2006), offer an powerful solution to this problem.HDP extends Latent Dirichlet Allocation (LDA) (Blei et al., 2003) by placing Dirichlet processes priors (DPs) (Ferguson, 1973) on the infinite-dimensional space of multinomial probability distributions.Thus the number of mixture components is infinite a priori and to be inferred from the data.In contrast, LDA posits a predefined number K of topics, each of which is a multinomial distribution over the vocabulary.Each document has specific topic proportions from a Dirichlet prior, and the topics are shared among the documents.Additionally, the HDP model allows sharing topics not only among documents but also across hierarchical levels by the use of multiple DPs.
The intuition behind our approach relies on the fact that the hierarchical DPs allow "new" senses to appear as needed, thanks to the theoretically infinite number of possible senses.Therefore, the hierarchical design of Dirichlet processes can capture the dynamic behaviour of the words, while inferring the optimal number of clusters directly from the data across time.
Word embeddings are another natural direction of potential improvement for DWSI.Introduced by Rumelhart and Abrahamson (1973); Bengio et al. (2003Bengio et al. ( , 2006)), they provide a distributed representation where words with similar meanings are close in a lower-dimensional vector space.Recently, various models have been proposed which integrate word embeddings for topic modelling, however these models do not necessarily represent both words and topics using embeddings.Dieng et al. (2019) provide an elegant solution to this problem: Dynamic Embedded Topic Model (DETM) is a parametric generative model inspired by D-LDA (Dynamic LDA) Blei and Lafferty (2006), in which each word is represented with a word embedding, and per-time topics are represented as embeddings as well.Topics and topic proportions evolve sequentially from adjacent time slices.DETM also directly models per-topic conditional probability of a word as the exponentiated inner product between the word embeddings and per-time topic embeddings.This results in a closer semantic correspondence between words and topics, and thus leads to better topics quality.
By contrast to previous contributions in DWSI which were mostly theoretical, this paper is an empirical contribution focusing on adapting different existing topic modelling techniques to DWSI.The aim is to set the state of the art DWSI models up against two serious competitors, in order to check whether they actually fit the task of DWSI optimally.In this perspective, we adapt HDP and DETM to the task of DWSI, describing our approach in §3.We test the ability of these models to detect meaning change over time using the evaluation framework proposed by (Alsulaimani et al., 2020), described in §4: using a large corpus of biomedical time-stamped data, including 188 ambiguous target words, we compare the proposed models with the current state of the art models NEO and SCAN.The results, presented in §5, show that HDP-based models achieve the best results over the dataset, establishing a new state of the art for DWSI.

Related Work
Topic modelling techniques are hierarchical probabilistic Bayesian models used originally for discovering topics in a collection of documents (Blei et al., 2010).Topic models have also been adopted for the Word Sense Induction (WSI) task, as introduced by (Brody and Lapata, 2009;Yao and Van Durme, 2011): word senses are treated as topics, and a short window around the target word (context) is considered instead of a full document.Topic modelling techniques have been extended further to similar tasks, such as Novel Sense Detection.
Novel Sense Detection (NSD; also called Novel Sense Identification), introduced by Lau et al. ( 2012), consists of determining whether a target word acquires a new sense over two independent periods of time, separated by a large gap.Several authors have used Hierarchical Dirichlet Processes (HDP) for this task over a small set of target words and/or small set of data (Lau et al., 2012(Lau et al., , 2014;;Cook et al., 2014).Yao and Van Durme (2011);Lau et al. (2012) show in a preliminary study that HDP is also superior to LDA for WSI, due to its ability to adapt to varying degrees of granularity.Lau et al. (2012) extend this study using an oracle-based method to identify new senses from HDP predictions for the task of NSD, and for only five target words.Sarsfield and Tayyar Madabushi (2020) used HDP for NSD on a larger dataset (Schlechtweg et al., 2020), which was proposed in a recent shared task about Lexical Semantic Change Detection (LSCD), a refined version of NSD: LSCD intends to answer the question of whether the meaning of a target word has changed between two independent periods of time (also separated by a large time gap).In the LSCD task, methods based on static word embeddings (where the meaning of the word is represented by a single vector) achieved the highest performance.
In contrast to NSD/LSCD, DWSI takes the time dimension into account and thus the task of DWSI is technically broader: it aims to discriminate senses and also models the temporal dynamics of word meaning across a long continuous period of time, e.g.year by year.As a result, DWSI can track the evolution of senses, the emergence of new senses and detect the year where a new sense appears.The DWSI task is introduced independently by Emms and Kumar Jayapal (2016) and Frermann and Lapata (2016); given a target word and a time-stamped corpus, both models estimate two main parameters: the senses as distributions over words, and the senses proportions over time.Frermann and Lapata (2016) extend this by also inferring the subtle meaning changes within a single sense over time, i.e. by allowing different word distributions over time for the same sense.
However, these models are parametric and require the number of senses to be chosen in advance.Previous approaches dealt with this issue by increasing the number of senses.For example, Emms and Kumar Jayapal (2016) vary the number of senses manually for every target word, while Frermann and Lapata (2016) choose an arbitrary fixed large number of senses for all the target words.
Additionally, evaluating and comparing such models on the DWSI task is difficult: the lack of large scale time-stamped and sense-annotated data hinders direct quantitative evaluation.The state of the art models, (Emms and Kumar Jayapal, 2016;Frermann and Lapata, 2016), were originally evaluated only qualitatively on a few hand-picked target words, with a manual investigation of the quality of the associated top words in each cluster; Frermann and Lapata (2016) also evaluated their model on several indirect tasks.Alsulaimani et al. (2020) demonstrate that these evaluation methods are insufficient, and consequently propose a quantitative evaluation of these DWSI models based on a large set of data.In particular, they show that the senses size distribution plays a significant role in capturing the senses representations and emergence of new senses.The number of senses is clearly a crucial hyperparameter for a DWSI model, the choice of which should in theory depend on the characteristics of the data.

Approach
3.1 Parameters Notation DWSI aims to discover the senses S across time Y for each target word in a sequential collection of documents, where senses are latent variables and the number of senses is unknown a priori.A DWSI model estimates at least two multinomial distributions: • P (W |S), the word given sense distribution.
The changes within senses across time can also be represented as P (W |S, Y ), the word given sense and year distribution.These distributions represent the sense.
• P (S|Y ), the sense given year distribution.This distribution represents the relative prevalence of a sense over time.

HDP-DWSI
HDP allows senses (i.e.clusters) to appear when a new context occurs, as the number of senses is determined by the data.HDP-DWSI directly relies on this property: in the first step, all the documents, independently from their year, are clustered by HDP.
Appendix A provides details about the description of HDP.This means that in this step the documents are assumed to be exchangeable, as opposed to dynamic models in which documents are only exchangeable within a time period.In the second step, the year of the document (observed variable) is reintroduced and the time-related multinomial parameters P (S = s|Y = y) are estimated by marginalising across the documents of each year j independently � , where f req(s d ) the number of words predicted as sense s in the document d, and d ∈ y represents the condition that the document d belongs to year y.
HDP-DWSI is intended to be used as a nonparametric method, but a parametric mode is also proposed for the purpose of evaluation and comparison against parametric models.In the nonparametric mode, the model parameters are obtained directly as described above.In the parametric mode, an additional step is required to reduce the number of senses because HDP-DWSI tends to induce a higher number of clusters than the gold number of senses, i.e. to split senses into multiple clusters.
Depending on the context of the application, it can also be relevant to reduce the number of senses even in the nonparametric mode.This can also be done with the method described below for the parametric mode, called HDP-DWSI m .
HDP-DWSI m consists in merging the predicted senses which are the most semantically similar.Agglomerative hierarchical clustering (Ward Jr, 1963) is used to merge senses, based on a sense cooccurrence matrix obtained from the HDP clustering output.
Pointwise Mutual Information (PMI) is used to represent how strongly two predicted senses are statistically associated, under the assumption of independence: where i � = j and P (s i , s j ) is the joint probability of observing both s i and s j in the same document.P (s i ) (resp.P (s j )) is the probability of a predicted sense with respect to the entire corpus, i.e. an occurrence is counted for every document in which the predicted sense s i (resp.s j ) independently occurs.
Moreover, since a pair of predicted senses with negative PMI is uninformative for the purpose of merging similar senses, Positive Pointwise Mutual Information (PPMI), as defined in Equation 2, is used for constructing the sense cooccurrence matrix.
(P)PMI is sensitive to low frequency events, particularly in the event when one of the predicted senses (or both of them) is/are less frequent with respect to the whole corpus; thus it is possible that two senses mostly cooccur together by chance, yet obtain a high (P)PMI value.In such a case, the two predicted senses are not semantically associated, so this is a potential bias in the merging process.
To counter this bias, we use the linkage criterion defined in Equation 3 as the average of the PPMI values weighted by their corresponding joint probabilities.The linkage criterion for two clusters C 1 , C 2 : where w(s1, s2) = P (s1, s2) The evaluation method proposed by Alsulaimani et al. (2020) (see §4) relies on the gold number of senses, as it is originally intended for parametric methods.In order to compare an HDP-based model against parametric models in an equivalent setting, the HDP-DWSI m merging method is used to reduce the predicted number of senses to the gold-standard number of senses.

DETM-DWSI
DETM represents not only the observed words but also latent topics/senses as embeddings, while preserving the traditional representation of a topic/sense as a probability distribution across words.The categorical distributions over the vocabulary is time dependent, i.e.P (W |S, Y ) and is derived from the corresponding word embeddings and sense embedding at a given time.DETM also places time-dependent priors over senses proportions: the use of Markov chain over the sense proportions allows smoothness of the variations between the adjacent senses at neighboring times (see Appendix A for the description of DETM).We propose two modes for DETM-DWSI as follows: • In the regular DETM-DWSI, both the word and sense embeddings are trained simultaneously.This mode does not require any additional resource but the corpus must be large enough for the embeddings to be accurate.
• In DETM-DWSI i , the model is trained with prefitted word embeddings.This mode leverages the external information contained in the embeddings, potentially obtaining a more accurate representation of the senses as a consequence.It also allows the application of the model to text containing words not present in the corpus, as long as their embedding is available.
In the experiments described below, the DETM-DWSI i models are trained using the BioWord-Vec pretrained word embeddings2 (Zhang et al., 2019).The fastText subword embedding model (Bojanowski et al., 2017) is a variant of the continuous skip-gram model (Mikolov et al., 2013).The fastText subword embedding can learn a distinct vector for each word while exploiting subword information in a unified n-gram embedding space.BioWordVec embeddings are trained with fastText on the PubMed text and MeSH terms, combined into a unified embedding space.In the biomedical domain, the advantage of a subword embedding model is that it can handle Out of Vocabulary (OOV) words (Zhang et al., 2019). 3This leads to a more precise word representation, in theory better able to capture the semantics of specialised concepts.We use the intrinsic BioWordVec embeddings (as opposed to the extrinsic type), meant to represent the semantic similarity between words (Zhang et al., 2019).
4 Experimental Setup

Data
We use the DWSI evaluation framework proposed by Alsulaimani et al. (2020): the biomedical literature is used as a source of labelled and timestamped data which covers the years 1946 to 2019. 4he dataset is collected from resources provided by the US National Library of Medicine (NLM): PubMed (a platform which includes the major biomedical literature databases) and MeSH (a controlled vocabulary thesaurus, created manually to index NLM databases). 5The data is preprocessed as in (Alsulaimani et al., 2020).The data consists of 188 ambiguous target words and 379 goldstandard senses (Jimeno-Yepes et al., 2011): 75 ambiguous target words have 2 senses, 12 have 3 and one has 5 senses.The total data size is 15.36 × 10 9 words, and the average number of documents is 61,352 by sense.The input documents for every target word consist of the occurrences of the target word which are provided with a window of 5-word context on each side as well as the year of publication.The gold-standard sense label is also available for evaluation purposes.

Algorithms Settings
• The HDP-DWSI and HDP-DWSI m models are trained using the official C++ implementation of HDP. 6 No additional preprocessing is needed.
• The DETM-DWSI and DETM-DWSI i models are trained using the implementation provided by Dieng et al. (2019). 7The preprocessing is adapted to the DWSI dataset: since the data is strongly imbalanced across time, stratified sampling is used in order to ensure a representative time distribution (with at least one instance by year) across the data partitions.
The data is split into 85% of instances for training and 15% for validation.The document frequency thresholds are unused so as to include all the words.For efficiency reasons, during training the number of instances is capped at 2,000 instances per year.

Evaluation Methodology
Since DWSI is an unsupervised task (clustering) and our evaluation is based on the external sense labels, both the estimation of the model and the evaluation are performed on the full set of documents for each target word.The gold-standard number of senses of each ambiguous target word is provided for all the parametric models (excluding HDP-DWSI).The default parameters are used in all the systems, 8 except the number of itera-tions/epochs (set to 500 for all the systems),9 and specifically for DETM-DWSI the batch size is set to 1000 and the dimension of the embeddings is set to 200.
After estimating each model for each ambiguous target word, the posterior probability is calculated for every document.The sense with the highest probability is assigned.

Evaluation Measures
We follow Alsulaimani et al. (2020) for the evaluation measures with some adjustments, detailed below.
The "Global Matching" method, presented by Alsulaimani et al. (2020), consists in determining a one-to-one assignment between predicted senses and gold senses based on their joint frequency: the pair with the highest frequency is matched first, and this process is iterated until all the senses are matched.In the case of HDP-DWSI, the number of predicted senses may be higher than the gold number of senses, and the instances of the predicted senses which remain unmatched are considered as false negative.This allows to compare HDP-DWSI with the parametric models, assuming that in theory the ideal nonparametric model would infer exactly the true number of senses.Of course, HDP-DWSI m is by definition more appropriate for a comparison in the parametric setting of HDP-based methods.
We also propose to use the V-measure as a different method of evaluation.The V-measure is introduced by Rosenberg and Hirschberg (2007), providing a different way to evaluate a clustering solution.In this case, it evaluates every cluster against every gold sense without relying on a matching method, thus providing an objective assessment even when the number of the clusters is higher than the true number of senses.The V-measure is based on entropy (entropy is a measure of the uncertainty associated with a random variable): it is defined as the harmonic mean of homogeneity and completeness, which are both based on the normalised conditional entropy.
Alsulaimani et al. ( 2020) also propose to evalu-ate the emergence of a new sense by considering whether the system predicts the true emergence year of a sense.This requires a method to determine the year from the P (S|Y ) distribution, for which the original algorithm "EmergeTime" was proposed in Jayapal (2017).We introduce "LRE-mergeTime" (see Appendix B Algorithm 1), an improved version of "EmergeTime" using linear regression instead of multiple thresholds within a window.Indeed, the original algorithm is very sensitive to the noise which sometimes occurs in the emergence pattern.Linear regression handles this issue better, since it measures the global trend across the window.10 The emergence year is evaluated as in (Alsulaimani et al., 2020): (1) with standard classification measures, considering the sense as correctly predicted if the year is within 5 years of the true emergence year; (2) with (normalized) Mean Absolute Error, representing the average difference in number of years but also penalizing the wrongly predicted presence/absence of emergence.
Finally we also use the distance between the true and predicted evolution of the senses over time (P (S|Y )) as an evaluation method for DWSI, again following Alsulaimani et al. (2020).

Qualitative exploration
We explore the temporal meanings of "SARSassociated coronavirus" over the years (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018) as an example.The ambiguous word has two gold-standard senses described by UMLS concepts C1175175 and C1175743: Severe Acute Respiratory Syndrome (refers to the disease caused by the virus) and SARS Virus (refers to the virus related to the Coronavirus family causing the disease) respectively.The top words represented by the inferred parameter word given sense, identified by HDP-DWSI m for the first sense are {patients, outbreak, sars, 2003, epidemic, health, case, transmission, hospital} and for the second sense are {cov, sars, coronavirus, patients, infection, protein, respiratory, acute, syndrome, cells}.Figure 1 shows the relative prevalence of the two inferred and gold senses over time, and Table 1 shows the top inferred words/usages associated with sense C1175175 at specific times.
In Figure 1

Matching-based Evaluation
Table 2 shows the performance of the six models according to standard classification and regression measures using "Global Matching".In general, DWSI models based on HDP perform well compared to NEO or SCAN.In the case of HDP-DWSI, "Global Matching" causes two observable effects: it increases precision, by allowing the system to choose the best predicted clusters matched with the gold senses; but it also decreases recall by introducing a large number of false negative cases due the discarded unmatched predicted clusters.Nevertheless the macro F1 score for HDP-DWSI is much higher than both NEO and SCAN, by 17.7% and 13.8% respectively.This shows that HDP-DWSI can distinguish minority senses significantly better.This can also be seen in The superiority of HDP-DWSI m is even clearer: the macro F1 score is 20.8% higher than NEO and 16.8% higher than SCAN; the performance difference in micro F1 score is even stronger: 21.0% above DETM-DWSI i , 17.4% higher than DETM-DWSI, 25.0% above NEO and 33.3% above SCAN.Contrary to the differences between NEO and SCAN, HDP-DWSI m improves performance significantly across the board: both precision and recall are drastically higher, according to both micro and macro scores.This means that HDP-based models are fundamentally much better at discriminating the different senses (with a very significant p-value < 0.05), as opposed to strategically favouring large senses for instance.This is confirmed in Table 3.11 The two DETM-based models perform very well, in particular achieving micro F1-score much higher than NEO and SCAN.However their macroaverage performance is comparable to NEO and SCAN, a clear sign that they do not separate the senses better.Table 4: V-measure, homogeneity and completeness for all the systems.Both the mean and median across targets are reported, because the strong differences between targets in terms of size and distribution of the senses may cause a bias with the mean.
Table 4 shows the results of the systems for Vmeasure, with details about homogeneity and completeness.HDP-DWSI and HDP-DWSI m perform the best at all three levels, with values far above the other systems.HDP-DWSI has the highest homogeneity mean, because this model produces a higher number of smaller predicted senses; these predicted senses are therefore more homogeneous in general, but also less complete since the gold senses are often split.HDP-DWSI m merges the senses predicted by HDP-DWSI, thus obtaining lower homogeneity but compensating with higher completeness, leading to higher mean V-measure.
Figure 2 offers a more precise picture of the differences between systems about their V-measure distribution.It confirms that DETM-DWSI, DETM-DWSI i and SCAN perform very similarly.It shows that the higher performance of DETM-DWSI, DETM-DWSI i and SCAN compared to NEO is due to a minority of targets, as their 75% lowest scores are almost identical.These targets cause most of the high difference in mean between NEO and SCAN, as the smaller difference in medians shows.
By contrast, HDP-DWSI and HDP-DWSI m have a much smaller proportion of low scores.Interestingly, HDP-DWSI has higher low scores than HDP-DWSI m , i.e.HDP-DWSI performs better until both systems reach the median.However HDP-DWSI m skyrockets just after the median and surpasses HDP by having much higher high scores.This explains why the median is slightly lower for HDP-DWSI m than HDP while the mean is much higher for HDP-DWSI m .12V-measure can introduce a bias towards systems which predict a number of clusters larger than the number of gold senses.Such systems tend to have very high homogeneity scores and low completeness scores.However, this is not the case for HDP-DWSI.The HDP-DWSI performance is high not only according to the V-measure but also confirmed by the F1 scores.The number of senses predicted by HDP-DWSI in average is 8 senses, with the minimum 4 senses and the maximum 13 senses.The Pearson correlation between homogeneity and completeness is 0.853 and with very significant p-value, 2.2e-16.Also, it is found that there is virtually no correlation between the predicted number of senses and either the size of the data or V-measure by target: 0.065, 0.008 (non significant: p-value = 0.3746, 0.261).This indicates that HDP-DWSI is not biased towards generating more senses when the data is larger.

Comparison between Measures
Table 5 shows that all the evaluation measures are significantly correlated.The macro-F1 scores are positively correlated in all four systems.However, the micro F-score favours systems that perform well on the majority sense, whereas the V-measure explicitly evaluates every cluster, taking into account not only the majority sense but also the minority one.Therefore systems which favour the majority sense, like NEO and DETM-DWSI i , have a lower correlation.DWSI systems can also be evaluated based on their ability to predict the year of emergence of a new sense.Table 6 shows the performance of the systems after applying "LREmergeTime" (see §4.4 ) on the predictions of the systems.HDP-DWSI m and NEO perfom closely to each other and much better than the other systems, according to both classification measures and MAE.NEO was designed and implemented with a focus on detecting sense emergence, this probably explains why it performs particularly well in this task (Jayapal, 2017).

Evaluation based on the predicted evolution over time
Table 7 shows for every system how well their prediction of P (S|Y ) matches the true evolution of sense.Among all the systems, HDP-DWSI m predicts the closest P (S|Y ) to the true evolution according to both distance measures.This confirms that not only HDP-DWSI m produces accurate predictions of the emergence year of novel senses but also predicts accurately the P (S|Y ) trends in general, with significantly less errors than the other systems.

Conclusion and Discussion
In this paper we adapted two topic modelling methods to the task of DWSI and evaluated them against two state of art DWSI systems, NEO and SCAN, using the evaluation framework proposed by Alsulaimani et al. (2020).We also compared using the V-measure, and proposed an improved version of the emergence algorithm.
The results show that HDP-based models are able to fit the data better than the parametric models.
The results strongly show that merging HDP-DWSI clusters performs better than the DETM-DWSI models and LDA-like clustering, such as NEO and SCAN.The properties of HDP make it better at accurately fitting the topics/senses, in particular when there is a high imbalance between the senses proportions, i.e. with senses smaller in size (see Table 3).Furthermore, the fact that HDP-DWSI m outperforms all the other parametric models also demonstrates that these models do not find the optimal separation between the senses.It seems that the additional complexity of the time dimension together with the parametric constraints do not cope well with data imbalance across years.
One could naturally assume that models designed specifically for a task would perform better on it.
Implicitly, the research community encourages the creation of new models and tends to reward theoretical contribution over empirical ones.Thus there might be a bias in favor of designing sophisticated ad-hoc models (like NEO and SCAN) rather than adapting existing robust models (like HDP).

Biomedical Domain
The dataset used in these experiments belongs to the biomedical domain and it is in English language.
There is no clear reason why the comparison between models would lead to different results on different domains, therefore we would expect the reported results (at least the major tendencies) to be also valid on the general domain.
Nevertheless this assumption would need to be tested experimentally.To our knowledge, there is no equivalent dataset available in the general domain which satisfies the two following conditions: • Time-stamped documents spanning a relatively long period of time; • Every document labelled with the sense of the target word.

Duration of the Training Stage
In the table below, we present the computational cost of training the different models presented in this paper.Most of the experiments were carried out on a computing cluster containing 20 to 30 machines with varying characteristics, thus the total duration is approximative.
Computing times are reported in hours of CPU/GPU activity required to train the total of 188 target datasets.It is important to note that the two DETM models are trained on GPUs, whereas all the other models are trained on regular CPUs.Thus in overall computing power, the DETM models are the most costly to train (more than HDP, despite the higher duration).
• Draw initial sense proportion mean η 0 ∼ N (0, I) • For time step t = 1, ...., T : • For each document d: -Draw sense proportions θ d ∼ LN (η t d , a 2 I) -For each word n in the document d: • ET: "EmergeTime" emergence year, • FYO: indicates the "First Year Occurrence" of a sense, determined by the start date of each sense in the data, • MS: indicates the "Manual Surge", i.e. the visual manual annotations by the authors.The value "NA" indicates cases when no emergence found and "?" indicates visually ambiguous cases found during the manual annotation by the authors.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 4 and 5 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.The data used in this research is a secondary data which was previously published.
The data source files were taken from NML and is made of biomedical scientific publications.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 7 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 4 C Did you run computational experiments?
Section 4 and 5 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 4 and 7 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.

8924
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Dynamic representations of "SARSassociated coronavirus".On the Y-axis, P (S|Y ) shows the relative prevalence of the gold senses as well as the predicted senses across time estimated by HDP-DWSI m .

Figure 2 :
Figure 2: Quantile plot of the V-measure scores by system, with the quantile rank shown on the X axis and the corresponding value on the Y axis.Example: for HDP-DWSI, the median (x=0.5) is y=0.16.The graph is obtained by sorting the values, then normalising their rank between 0 and 1.
Pearson correlation coefficients: the relationship between the performance according to different measures.All the results are significantly correlated with p-value <= 5.6e-13.The systems are referred to by their initials.

-Figure 3 :
Figure 3: Left: graphical representation of HDP for DWSI.Observed variables represented by shaded nodes and latent variables by clear nodes.Right: the corresponding generative process.Note that in DWSI the sense related variables replace the topic related variables.

Figure 4 :
Figure 4: Left: graphical representation of DETM for DWSI.Observed variables represented by shaded nodes and latent variables by clear nodes.Right: the corresponding generative process.Note that in DWSI the sense related variables replace the topic related variables.

A3.
Do the abstract and introduction summarize the paper's main claims?Abstract and Section 1 A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 1 and 4 and 5 B1.Did you cite the creators of artifacts you used?Section 4 and 5 B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Section 4

Table 1 :
, both senses data start in 2002, however the prevalence of sense C1175175 was decreasing progressively from 2002 to 2018 since SARS was successfully contained in 2004, while the prevalence of the sense C1175743 kept increasing since the research about the SARS virus became a priority for the public health around the world.The temporal changes of the top words within C1175175 are highlighted in Table1.Historically, the first known case of SARS appears in November 2002, causing the 2002-2004 SARS outbreaks in cities and hospitals.Global attention then started and in 2016, for instance, the top words shifted to facemask, post, era, sars.Finally, the year 2018 shows the concerns about a second wave of SARS.Temporal evolution of the top-7 words for the sense Severe Acute Respiratory Syndrome learned by HDP-DWSI m , at specific times.
Table 3 which shows the mean F1-score by senses size.
Table 3 confirms that the DETMbased models perform closely to NEO and SCAN.

Table 6 :
Sense emergence evaluation results for all the systems.The values in bold indicate the best score achieved among the systems.

Table 7 :
Mean distance between the true and predicted sense, measured by Dynamic Time Warping (DTW) and Euclidean distance (lower is better).The results in bold indicate the best system.
A1. Did you describe the limitations of your work?
C2. Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 5C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4 and 5 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.