Predicting Long-Term Citations from Short-Term Linguistic Influence

A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count offers limited information about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps: first, identify lexical and semantic changes using contextual embeddings and word frequencies; second, aggregate information about these changes into per-document influence scores by estimating a high-dimensional Hawkes process with a low-rank parameter matrix. We show that this measure of linguistic influence is predictive of $\textit{future}$ citations: the estimate of linguistic influence from the two years after a paper's publication is correlated with and predictive of its citation count in the following three years. This is demonstrated using an online evaluation with incremental temporal training/test splits, in comparison with a strong baseline that includes predictors for initial citation counts, topics, and lexical features.


Introduction
The citation count of a paper is a standard, easily measurable proxy for its influence (Cronin, 2005).Researchers have shown that citation count is strongly correlated with the quality of scientific work (e.g., Lawani, 1986), the recognition that a paper or an author gets (e.g., Inhaber and Przednowek, 1976), or in policy decisions such as assessment of scientific performance (e.g., Cronin, 2005).Consequently, citation count is a ubiquitously deployed and important measure of a paper with whole subfields of research dedicated to its analysis (Bornmann and Daniel, 2008).
However, papers may be cited (or not cited) for many reasons, and citation count alone is insufficient to explain the emergence and the spread of [50,75) [75, 90) 90 Percentile bin 0.05 0.10 0.15 Extra future log citations

Lexical influence Semantic influence
Figure 1: Research papers that are more linguistically influential within an initial time window tend to receive more citations in the long term.The x-axis shows lexical and semantic influence, binned into quantiles (see § 2); the y-axis shows the corresponding regression coefficients and standard errors, in units of Z-normalized log future citations (see § 5.3).To give a sense of scale, for papers published in 2012, being in the top decile of semantic influence corresponds to an 14.5% increase in long-term citations, as compared to control-matched papers that received the same number of short-term citations and covered similar topics but were in the bottom half by semantic influence.
research ideas and trends.For this reason, we turn to content analysis: to what extent can the text of a research paper be said to influence the trajectory of the research community?In this paper, we present a novel technique for estimating the influence of documents in a timestamped corpus.To demonstrate the validity of the resulting measure of linguistic influence, we show that it is predictive of future citations.Specifically, we find that: (1) papers that our metric judges as highly influential in the short term tend to receive more citations in the long term; (2) short-term linguistic influence increases the ability to predict long-term citations over strong baselines.
Our modeling approach focuses on semantic changes, and treats the temporal usage of semantic innovations as emissions from a parametric lowrank Hawkes process (Hawkes, 1971).The pa-rameters of the Hawkes process correspond to the linguistic influence of each paper, aggregated over thousands of linguistic changes.The changes themselves are identified through analysis of contextual embeddings, with the goal of finding words whose meaning has shifted over time (Traugott and Dasher, 2001).Though there are several computational methods to detect semantic changes (e.g., Kim et al., 2014;Hamilton et al., 2016;Rosenfeld and Erk, 2018;Dubossarsky et al., 2019), including methods based on contextual embeddings (e.g., Kutuzov and Giulianelli, 2020), our proposed method focuses on detecting smooth, non-bursty semantic changes; we also go further than other methods by distinguishing old and contemporary usages of an identified semantic change.
We show through a multivariate regression that our estimates of semantic influence of each paper are positively correlated with their long-term citations, even after controlling for the initial citations, the content of the paper in terms of topics, and the lexical influence of the paper (see Figure 1).Further, we formulate long-term citation prediction as an online prediction task, constructing test sets for successive years.The addition of semantic influence as features to a model once again improves the predictive performance of the model over baselines.In summary, our contributions are as follows:1 • We empirically demonstrate a link between long-term citation count and short-term linguistic influence, using both regression analysis ( § 5.3) and an online prediction task ( § 5.4).
• We present a method to estimate semantic influence using a parametric Hawkes process ( § 2.1).To achieve this, we find semantic changes and convert the usage of each change into a cascade ( § 2.2).We also show that the method can be applied to quantify lexical influence.
• We present a method to identify monotonic semantic changes from timestamped text using contextual embeddings (see § 2.2.1).

Methodology
This section describes our method for estimating the linguistic influence of each document in a times-tamped collection.Our work builds on the theory of point process models (Daley et al., 2003), in which the basic unit of data is a set of marked event timestamps.In our case, the events correspond to the use of an innovative word or usage; the mark corresponds to the document in which word or usage appears.To estimate linguistic influence of individual documents, we fit a parametric model in which per-document influence parameters explain the density of events in subsequent documents.We first describe the modeling framework in which these influence parameters are estimated ( § 2.1) and then describe how event cascades are constructed ( § 2.2) from semantic changes ( § 2.2.1) and lexical innovations ( § 2.2.2).

Estimating document influence from timestamped events
A marked cascade is a set of marked events {e 1 , e 2 , . . ., e N }, in which each event e i = (t i , p i ) corresponds to a tuple of a timestamp t i and a mark p i .Assume a set of marked cascades, indexed by w ∈ W, with each mark belonging to a finite set that is shared across all cascades.In our application, each cascade corresponds to the appearances of an individual word or word sense, and each mark is the identity of the document in which the word or word sense appears.
Point process models define probability distributions over cascades.In an inhomogeneous point process, the distribution of the count of events between any two timestamps (t 1 , t 2 ) is governed by the integral of an intensity function λ(t, w).A Hawkes process is a special case in which the intensity function is the sum of terms associated with previous events (Hawkes, 1971).We choose the following special form, where κ is a time-decay kernel such as the exponential kernel κ(∆t) = e −γ∆t and c w is a constant.The parameter of interest is α, which quantifies the influence exerted by the document p (w) i on subsequent events.2 Our application focuses on research papers, which historically have been published in a few bursts -at conferences and in journals -rather than continuously over time.For this reason we simplify our setting further, discretizing the timestamps by year.The evidence to be explained is now of the form n(t, w), the count of word or sense w in year t.We model this count as a Poisson random variable, and estimate the parameters c w and α by maximum likelihood.

Building event cascades
To estimate the parameters in Equation 1, we require a set of timestamped events.Ideally these events should correspond to evidence of linguistic innovation.We consider two sources of events: semantic innovations (here focusing on words whose meaning changes over time) and lexical innovations (words whose usage rate increases dramatically over time).
We now introduce some notation used in the remainder of this section.Let a document be a sequence of discrete tokens from a finite vocabulary V, so that document i is denoted i , . . ., x ], with n i indicating the length of document i.A corpus is similarly defined as a set of N documents, X = {X 1 , X 2 , ..., X N }, with each document associated with a discrete time t i ∈ T .

Using contextual embeddings to identify semantic changes
We use contextual embeddings to identify words whose meaning changes over time, following prior work on computational historical linguistics (e.g., Kutuzov and Giulianelli, 2020, see § 6 for a more comprehensive review).A contextual embedding ∈ R D is a vector representation of token k in document i, computed from a model such as BERT (Devlin et al., 2019).When the distribution over h for a given word changes over time, this is taken as evidence for a change in the word's meaning.
Let m t − ,w and m t + ,w be the count of the word w up to and after time t, respectively.Specifically, Average representations of the word w up to and after time t, respectively, are calculated as follows.
Further, the variance in the contextual embeddings of the word w over the entire corpus is calculated by taking the variance of each component of the embedding, with µ w equal to the mean contextualized embedding of word w.
A semantic change score for a word w for a time t is then the variance-weighted squared norm of the difference between its average pre-t and postt contextualized embeddings (also known as the squared Mahalanobis distance): with S w = Diag(s w ).
Correction for frequency effects.Both the mean and variance are estimated with larger samples for timestamps in the middle of T in comparison to the initial and final timestamps.Consequently, the distance metric suffers from high sample variance for values of t near these endpoints.The discrepancy is corrected by replacing the diagonal covariance S w in Equation 3 with an alternative covariance Sw that reflects that additional uncertainty due to sample size.Specifically, we approximate the standard error of the mean v t − as S/m t − , and analogously for v t + .Then Sw is defined as the product of these two approximate standard errors, .
(4) Finally, t * = argmax t r(w, t) is selected as the transition point for the change in meaning of w.The changes are identified by sorting the words by max r(w, t) and applying a set of basic filters explained in § 4. To give some intuition: • If w changes in meaning at time t, then the difference in its representation up to t and after t should be high.The metric in Equation 3 captures this precisely by calculating the term v t − ,w − v t + ,w .
• Difference in average embeddings can be high for seasonal or bursty changes seen in words such as turkey which is referred to the bird more frequently at the time of American holidays (Shoemark et al., 2019).Rescaling the difference by the inverse variance encourages detection of monotonic changes.
• For rare words, the mean embeddings will be less reliable.The √ m terms in S have the effect of emphasizing high-frequency words for which changes in the mean embedding are likely to be significant.
Distinguishing old and new usages.The previous step yields semantic innovations and their transition time.Simply identifying semantic changes is insufficient, since at any given time a word could be used in its old or new sense with respect to its time of transition.To categorize every usage of a semantic innovation w, the contextual embeddings are passed through a logistic regression classifier that predicts whether the usage is before or after the transition time.At the end of this step a sequence of embeddings for any semantic innovation is converted to a sequence of binary labels denoting their usage.For each word w, the cascade (e i .These cascades are the evidence from which we estimate the per-document semantic influence scores α s , as described in § 2.1. Why contextual embeddings?Embeddings provide a powerful tool for understanding language change, offering more linguistic granularity than measures of change in the strength or composition of latent topics (e.g., Griffiths and Steyvers, 2004;Gerow et al., 2018).Prior work has employed diachronic non-contextual embeddings (e.g., Soni et al., 2021b).Such methods require each word to have a single shared embedding in each time period.During periods in which a word is used in multiple senses, the non-contextual embedding must average across these senses, making it harder to detect changes in progress.

Identifying lexical changes
Unlike semantic changes, whose identification requires representations such as contextual embeddings, lexical changes are identified simply by comparing frequency changes.Specifically, for every word in a vocabulary we vary the segmentation year, say t, for the word and calculate the relative frequency up to and after t.We then take the best relative frequency ratio across the years as the score of lexical change for that word and aggregate to form a list of changes by sorting on this score.In contrast to semantic changes, all the usages of lexical changes are used to form cascades.These cascades are the evidence from which we estimate the per-document lexical influence scores α ℓ , again using the methods in § 2.1.

Overview
To summarize the method for computing semantic influence: 1. Compute the score r(w, t) for each word w and time t as described in Equation 3(with the adjusted covariance term from Equation 4), and threshold to identify semantic changes.
2. For each word selected in the previous step, classify each usage as either "old" or "new", and build a cascade from the timestamps of the new usages.
3. Aggregating over all the cascades, estimate the influence parameters α i for each document in the collection.
A visual summary of the entire methodological pipeline is given in Figure 2.

Data
To construct a collection of research papers, we focus on papers that are included in the ACL anthology.We collected the ACL anthology bibliography file3 and converted the bib entries from the file as JSON objects; we retained the title of the paper, the year in which it was published, and the venue.
We then stripped all whitespace and special characters from the title of the paper.These stripped titles and the year of publication are matched with papers in s2orc corpus (Lo et al., 2020) 4 .Matched papers that have a valid pdf parse (i.e full text of the paper) are retained.Though the s2orc dataset contains papers from as far back as 1965, the coverage in the early years is sparse with few or no papers in many of the early years.As a result, the data is further filtered to retain only the papers that appear from 1990 to 2019 (T = [1990,2019]

Experimental Setup
For this study, multilingual BERT is used as the contextualizing model even though our data is English papers.This is to handle even those English language papers that have foreign language tokens.Specifically, the bert-base-multilingual-uncased model from the Hugging face (Wolf et al., 2020) library is used. 5The size of the contextualized embeddings is 3144 dimensions after concatenating the final four layers.

Continued pretraining Previous work has
shown that the quality of the contextual embed-dings improves when the pretrained BERT is further trained on domain-specific text (e.g., Gururangan et al., 2020).For this study, we continued to pretrain BERT model for 3 epochs to optimize the masked language modeling objective.The probability of masking is set to 15 %.
Wordpiece aggregation Since BERT learns subword embeddings by breaking tokens into wordpieces, the embeddings of the wordpieces need to be aggregated to get a representation of a token.This aggregation is done by taking the average of the wordpiece embeddings. 6ata preprocessing Non-English papers in the corpus are ignored from the analysis by identifying the language of the papers using langid (Lui and Baldwin, 2012).The vocabulary V is constructed by retaining words that appear at least 10 times in the abstracts and do not appear in more than 90 % abstracts.Each paper is first segmented by whitespace and then broken into chunks of 200 tokens.Only alphabetic tokens are retained.

Classifying individual usages of semantic innovations
The off-the-shelf logistic regression classifier from scikit-learn is used to mark every individual instance of a semantic innovation as new or old.To avoid overfitting, we use l 2 regularization; all other inputs to the classifier are set to default.4-fold cross-validation is performed to get the final assignment of labels from the classifier.
Word filters We keep words in our vocabulary if they are composed only of alphabetic characters, occur in at most 90% of the papers, and occur a minimum of 30 times in the entire corpus.We also eliminate words whose length is less than or equal to 2 characters.
Estimation To estimate the parameters of the Hawkes process, we use scipy.optimize,which internally uses the L-BFGS solver.

Semantic changes
We identified 2910 semantic changes that capture several technical concepts in language research.The top changes and the period in which their meanings shift are shown in Table 2.
The evolution of language research, from the earlier focus on syntax and sequence processing using latent variable models to the current paradigm of using deep learning, is neatly summarized by the semantic innovations that the method identifies.Changes such as tokenization and transducers from the late nineties are indicative of the then-structural approach to core NLP research.
The earlier part of the 2000s saw changes in terms such as plan (see Table 5 for context in which the term appears), whose narrow usage in messaging applications broadened to other applications.The next decade also saw changes in terms such as kernel and probabilistic.These indicate the methodological changes that were underway during this period, with NLP research being dominated by a mix of kernel and bayesian methods during this decade (e.g., Moschitti, 2004;Blei et al., 2003).Methodological innovations such as conditional random fields (Lafferty et al., 2001) and the rise of domain adaptation (e.g., Chelba and Acero, 2004;Daumé III, 2007) are also evidenced by terms such as conditional and adaptation.
With the rise of neural approaches, words such as representations, network, and decoder underwent semantic changes between the years 2013 to 2017.Another prominent example of this shift is the term attention, shown in Figure 3, which shifts from its standard, broad usage to the more technical and focused usage with respect to neural networks around 2015.

Lexical changes
We selected the top 3000 lexical innovations to approximately match the number of semantic innovations.The lexical changes capture the introduction and rise in popularity of terms in language research.Unlike semantic changes, lexical changes are identified only by their change in frequency.
Among the top changes are terms such as bert, lstm, adam, and mturk which are examples of new models, algorithms, tools, and technology introduced in language research.On the other hand, example changes such as factuality (e.g., Saurí and Pustejovsky, 2012;de Marneffe et al., 2012;Soni et al., 2014) and sarcasm (Riloff et al., 2013;Ptáček et al., 2014)

Regression analysis
Our objective is to test whether the linguistic influence of a paper is positively correlated with its rate of future citations.However, many factors can confound our analysis including, but not limited to, the early citations a paper gets and the content of the paper.To control for these confounds and test our hypothesis, we frame the problem as a multivariate regression where features that proxy linguistic influence are incorporated alongside proxy features of other factors to predict the future citations.For our analysis in this section and § 5.4, we consider papers published in or after the year 2000, since the density of innovations appearing in these years is higher.The total number of papers in this interval is 19153.
Our unit in the multivariate regression is a research paper and the dependent variable is the Znormalized logarithm of its future citations.The Z-normalization uses a unique mean and variance for each year of publication, which helps to account for secular trends in the overall rate of citation over time.By "future citations" we mean the difference between the number of citations a paper gets five years after its publication (hereon referred as "long-term citations") and the number of citations a paper gets two years after its publication (hereon referred as "short-term citations").For example, for a paper published in 2012, the short-term citations are from the period 2012 − 2014 and the long-term citations are the citations accrued between 2015 − 2017.
To test the impact of semantic influence, we include three baseline regression models.In the first baseline, M1, we include the Z-normalized shortterm citations and a constant term as our only covariates.M2, our second baseline, consists of all covariates in M1 and the topic distribution of a paper learned from an LDA model (Blei et al., 2003).The topic distribution is taken as a coarse representation of the content of the paper.Our final baseline is M3 which contains all the covariates from M2 in addition to categorical covariates corresponding to quantiles of the Z-normalized lexical influence, α l , of each paper.We consider four quantiles: < 50 th percentile, ≥ 50 th and < 75 th percentile, ≥ 75 th percentile and < 90 th percentile, and ≥ 90 th percentile.Finally, our experimental model, M4, has all the covariates from M3 and additional categorical covariates corresponding to the quantiles of the Z-normalized semantic influence, α s , of each paper.The quantiles are divided in the same way as lexical influence.
The experimental model can be compared with the baseline models by their goodness-of-fit, measured by the log-likelihood of the data; analogously, the null hypothesis is that the goodness-of-fit of the experimental model is no better than that of the baseline models.Statistically, the likelihood ratio, our test statistic, follows a χ 2 distribution with the excess number of parameters in the experimental model as the degrees of freedom.The null hypothesis can be rejected if the observed test statistic is determined to be unlikely under this distribution.
The regression coefficients are shown in Table 3.7Not surprisingly, short term citations are the strongest predictor of long-term citations, as seen by the strength of the regression coefficient.The regressions further reveal a strong relationship between semantic influence and long-term citations: M4 obtains a significantly improved fit over M3, our strongest baseline (χ 2 (3) = 91, p ≈ 0.0).Without additional controls, the average rate of long-term citations for the top quantile of semantic influence is 3 times the long-term citation rate for  -18828 -18681 -18615 -18569 Table 3: Regression analysis.We show the results of long-term citations for various ablations.Each column indicates a model, each row indicates a predictor, and each cell contains the coefficient and, in parentheses, its standard error.Topics are included as controls in models M2-4, but for clarity their coefficients are reserved for the supplementary material.Results for the best bandwidth parameter (γ=100), selected by the best heldout loglikelihood, are produced here whereas the regression results for other bandwidth settings are in the supplementary the bottom quantile.With additional controls, the top quantile of semantic influence amounts to an increase in the expected citations by a factor of 1.2, in comparison to the papers in the bottom quantile.

Predicting future citations
We now turn to predicting the long-term citations from semantic influence and the other predictors described in § 5.3.To more closely match the scenario of true future prediction, we formulate this as an online prediction task, in which the model is trained on past data to make predictions about future events (Karimi et al., 2015;Søgaard et al., 2021).Formally, to make predictions about papers published in year t, we use information from the interval [t, t + 2] to compute the predictors: shortterm citations, lexical influence, and semantic influence.We then make predictions about citations in years [t + 3, t + 5].To estimate the weights of these predictors, we assume access to training data up to year t + 2. We then increment t and make predictions about the papers published in the next year.In this way, all papers published in the period 2001-2014 appear in one of the test folds.The rest of the setup is similar to § 5.3 except one important difference.For the prediction task, we plug in estimates of lexical and semantic influence for all the values of γ = {0.001,0.01, 0.1, 1.0, 10.0, 100.0} as predictors in the model.The results of the online prediction of long-term citations are shown in Table 4.The performance is measured using mean squared error (MSE) between the predicted and ground-truth values.The model M4, which includes our measure of semantic influence, achieves the lowest error in 13 of 14 years, and it gives a more accurate prediction than M3 for 57.8% of the 18554 papers in this slice of the dataset.
6 Related Work

Linguistic change and influence
Several computational methods have been developed to identify changes in language (Eisenstein, 2019).Of particular interest are techniques for detecting semantic changes in a text corpus.Such techniques are based on a range of representations, including frequency statistics (e.g., Bybee, 2007), static, type-level word embeddings (e.g., Sagi et al., 2009;Wijaya and Yeniterzi, 2011;Kulkarni et al., 2015;Hamilton et al., 2016), and contextual word embeddings (e.g., Kutuzov and Giulianelli, 2020;Giulianelli et al., 2020;Montariol et al., 2021).Here, we use contextual embeddings which are, in principle, advantageous over static embeddings as they can distinguish the dynamics of co-existing senses.
Although there are many methods to detect changes, only a few computational studies find leaders or followers of these changes, which is important in order to understand who carries influence.By modeling lexical changes as cascades on a network, researchers have inferred that they propagate because of influence from strong ties (e.g., Goel et al., 2016).Other researchers have identified leaders and followers of individual semantic changes and aggregated them to induce a leadership network between the sources (Soni et al., 2021a).Our work shares similarities with these prior studies but is distinct: We use similar cascade modeling techniques but for semantic changes, which are considerably harder to construct.
Most relevant to our current work is that of Soni et al. (2021b) who find that semantically progressive scientific research papers get more citations.Semantic progressiveness -a measure of linguistic novelty -is calculated by comparing the old meaning of semantic innovations with their contemporary meaning in the context of the document.Our current work is different from this prior work in a key aspect: We estimate and establish a link between citation influence and semantic influence, instead of semantic novelty.

Citation influence
Citation count has historically been used as a proxy for the influence of a scientific article (Fortunato et al., 2018), of researchers (Börner et al., 2004), and is shown to be strongly correlated with scientific prestige (Cole and Cole, 1968).Relevant to our work are studies that establish a link between citation influence to different measures of linguistic progressiveness.Kelly et al. (2018) find that progressiveness as measured in terms of difference in textual similarity between old and new patents is predictive of future citations of a patent.Similarly, Soni et al. (2021b) find that progressiveness measured as the early adoption of words of with newer meanings is predictive of citations of a paper.In contrast, in this paper, we find a link between linguistic influence in the short term to the future citations of the paper.

Conclusion
We have presented a new technique for quantifying semantic influence in time-stamped documents.Quantitative analysis demonstrates that this measure of semantic influence is strongly correlated with long-term citations a paper receives, and leads to improvement in the prediction of future citations.Our tool offers additional granularity in terms of linguistic influence, which can supplement structural measures of influence based on citation counts.Though we present quantitative analyses for scholarly documents in computational linguistics, our tool could be applied to scholarly documents in other research areas or to documents such as patents or court opinions where citation counts are considered structural measures of influence.We plan to focus on these applications in the future.

Limitations
A simplifying assumption in this paper is there exists one dominant sense of a change before and after the transition point.This assumption may not hold for every change, in general, but helps in developing computational methods to identify a large array of changes.In future work, we plan to extend the ability of our method to identify coevolving senses.
A fundamental limitation of the Hawkes Process is the closed-world assumption that all events are attributable to other observed events.This limitation is particularly relevant to our setting, where we observe only papers published in the ACL anthology, but those papers influence and are influenced by a much wider discourse, which includes not only other academic research papers but also software artifacts, books, and social media.In practice, this means that our method might wrongly assign credit to "fast follower" papers that are the first to adopt ideas published outside the ACL universe.Similarly, we make no attempt to measure the extent to which ACL anthology papers influence writing that is published elsewhere.
More generally, we cannot show whether the relationship between linguistic influence and citations is causal.The temporal asymmetry ensures that the future citations are not themselves causes of linguistic influence, but we cannot exclude the possibility that there is a common cause for both phenomena.For example, it seems likely that factors such as the overall quality of the research and the fame of the authors both contribute to the extent to which a paper drives the adoption of linguistic features in the short term, and to the number of citations it receives in the long term.Our regression analysis includes control variables for some potential common causes, such as topics, but it is not possible to control for all other potential confounders.Hence, our analysis should be considered correlational and not causal.Future work could focus on establishing and quantifying a causal link between linguistic influence and citations.

Ethics Statement
This paper offers a new tool for understanding scientific communication.Because this tool quantifies the linguistic impact of research papers, there is the possibility that it could be used for consequential decisions such as hiring, promotion, and funding.This implies a "leaderboard" approach to scholarship that would overvalue the most fashionable mainstream research topics, while penalizing research that has a deep impact in a relatively small community.Similar concerns have been raised about other measures of academic impact: Jorge Hirsch, the inventor of the H-index, noted that his metric could have "severe unintended negative consequences," and urged evaluators to go beyond any single index to consider the broader context when considering an individual's scientific contributions (Conroy, 2020).The same applies to semantic influence metric defined in this paper.Semantic change in the term attention in s2orc's ACL anthology subset.The blue line indicates the transition year for meaning change.The transition year for the term attention coincides with early papers that described the attention mechanism in neural networks (Bahdanau et al., 2015) that later became the bedrock of transformers architecture (Vaswani et al., 2017) Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush.2020.Transformers: State-of-the-art natural language processing.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38-45, Online.Association for Computational Linguistics.

A Examples of Semantic Changes
We show more statistical details for some of the semantic changes and the context in which these changes occur in Table 5.We also show an illustrative example of a change for the word attention and how it transitions according to our metric in Figure 3.

B Topic Coefficients
To control for the content of the paper, we use a coarse-grained representation of the content by learning an LDA model and estimating the probability distribution of a research paper in terms of the topics.The probabilities are used as features in the regression and online prediction tasks.The regression coefficients of the topics in the full model, M4, are shown in Table 6.

C Regression Results for Different Bandwidths
Different lexical and semantic influence estimates were learned by varying the bandwidth (γ).The bandwidth is a decay factor for the influence: higher bandwidth value corresponds to faster decay in influence and a lower bandwidth means a slower decay.The regressions were run for different values of the bandwidth setting {.001, .01,.1,1.0, 10.0, 100.0} and the optimal bandwidth was selected based on the goodness of fit on a 10% heldout sample.The regression results for all the bandwidths are presented in Table 7, Table 8, Table 9, Table 10 and Table 11 Nw ) is formed by filtering the usages to those that are classified as corresponding to the newer sense, with each event e (w) i containing a timestamp t (w) i and a document identifier p (w)

Figure 2 :
Figure 2: Methodological pipeline.The steps in our method can be summarized as follows for an example word attention.(A) depicts a collection of research papers that mention attention; (B) is a collection of contextual embeddings for attention across the entire corpus; (C) uses the contextual embeddings to find the transition point and the magnitude of the change; (D) uses the contextual embeddings to classify usages as old (marked with red crosses) or new (marked with green ticks) with respect to the transition time; (E) is a depiction of the event cascades comprising of timestamp and paper_id (p i ) pairs.

Figure 3 :
Figure 3: Visual depiction of change in top example.Semantic change in the term attention in s2orc's ACL anthology subset.The blue line indicates the transition year for meaning change.The transition year for the term attention coincides with early papers that described the attention mechanism in neural networks(Bahdanau et al., 2015) that later became the bedrock of transformers architecture(Vaswani et al., 2017)

Table 1 :
). Dataset summary.Descriptive summary of the curated ACL corpus from s2orc dataset.

Table 2 :
indicate the rise in popularity of these concepts during specific years.Abbreviations such as sts and mt, and names of languages such as de and indonesian are two categories of changes that prominently feature among Examples of semantic changes.We show a few handpicked examples amongst the top semantic changes in different periods.More context is shown in Table5.
top lexical changes.While the former indicates the necessity of naming technical concepts with memorable shortform, the latter is indicative of the rise in multilingual language research.

Table 4 :
Online predictive analysis We show the performance in terms of MSE for the ablated models on the online citation prediction task.The first column indicates the publication year, the subsequent columns are the various ablations as seen in Table3, and each cell shows the MSE.The last row is the micro-averaged MSE over all the examples.Note that smaller values indicate better predictive performance.

Table 5 :
. Semantic change examples.Top examples of semantic changes identified from the curated ACL corpus from the s2orc dataset.The relative counts are counts per million tokens.Terms such as attention get a new sense increasingly used later; terms such as plan shows semantic widening moving from strong association with dialogue to other NLP tasks; terms such as network and deep show semantic narrowing moving from disperse associations to a more narrower sense associated with neural networks.

Table 6 :
Topic coefficients and top words We show the coefficients of the topic in the experimental model M4 alongwith the top words by probability in a given topic.