Slangvolution: A Causal Analysis of Semantic Change and Frequency Dynamics in Slang

Languages are continuously undergoing changes, and the mechanisms that underlie these changes are still a matter of debate. In this work, we approach language evolution through the lens of causality in order to model not only how various distributional factors associate with language change, but how they causally affect it. In particular, we study slang, which is an informal language that is typically restricted to a specific group or social setting. We analyze the semantic change and frequency shift of slang words and compare them to those of standard, nonslang words. With causal discovery and causal inference techniques, we measure the effect that word type (slang/nonslang) has on both semantic change and frequency shift, as well as its relationship to frequency, polysemy and part of speech. Our analysis provides some new insights in the study of language change, e.g., we show that slang words undergo less semantic change but tend to have larger frequency shifts over time.


Introduction
Language is a continuously evolving system, constantly resculptured by its speakers. The forces that drive this evolution are many, ranging from phonetic convenience to sociocultural changes (Blank, 1999). In particular, the meanings of words and the frequencies in which they are used are not static, but rather evolve over time. Several previous works, in both historical and computational linguistics, have described diachronic mechanisms, often suggesting causal relationships. For example, semantic change, i.e. change in the meaning of a word, has both been suggested to cause (Wilkins, 1993;Hopper and Traugott, 2003) and be caused by (Hamilton et al., 2016) polysemy, while also part Figure 1: We observe very different change dynamics for the slang word "duckface" and the nonslang word "inclusive." "Inclusive" has acquired a new meaning, reflected in a high semantic change score of 0.77 as measured by our model. "Duckface" undergoes little semantic change, scored 0.39 by our model, while its usage frequency varies greatly. of speech (POS) has been implied to be a causal factor behind semantic change (Dubossarsky et al., 2016). However, none of these studies perform a causal analysis to verify these claims. Causality (Pearl, 2009) allows us to not only infer causal effects between pairs of variables, but also model their interactions with other related factors.
In this work, we focus on the linguistic evolution of slang, defined as colloquial and informal language commonly associated with particular groups (González, 1998;Bembe and Beukes, 2007), and use a causal framework to compare the change dynamics of slang words to those of standard language. More specifically, we compare the semantic change as well as the changes in frequency, i.e., frequency shift, over time between slang words and standard, nonslang words. We learn a causal graphical model (Spirtes et al., 2000) to assess how these variables interact with other factors they have been previously found to correlate with, such as frequency, polysemy and part of speech (Dubossarsky et al., 2016;Hamilton et al., 2016). Having discovered a graph, we proceed to use do-calculus (Pearl, 1995) to evaluate the causal effects of a word's type (slang/nonslang) on semantic change and frequency shift.
Semantic change is measured using the average pairwise distance (APD) (Sagi et al., 2009;Giulianelli et al., 2020) between time-separated contextualized representations, which were obtained from a Twitter corpus via a bi-directional language model (Liu et al., 2019). Our method builds on recent semantic change literature (Schlechtweg et al., 2020), with novel additions of dimensionality reduction and a combined distance function.
By deploying a causal analysis, we establish that there is not just an association, but a direct effect of a word's type on its semantic change and frequency shift. We find that a word being slang causes it to undergo slower semantic change and more rapid decreases in frequency. To illustrate, consider the slang word "duckface" and the nonslang word "inclusive" as shown in Figure 1. Duckface is a face pose commonly made for photos (Miller, 2011) in the early 2010s, and while it has largely decreased in frequency since, its meaning has not changed. In contrast, the nonslang word "inclusive" has developed a new usage in recent years (Merriam-Webster, 2019) and was given a high semantic change score by our model.
Our analysis also sheds light on a couple of previous findings in the diachronic linguistics literature. We find support for the S-curve theory (Kroch, 1989), showing a causal effect from a word's polysemy to its frequency. This relationship is evident in the increase in frequency that the word "inclusive" displays in Figure 1 after it develops a new meaning (Merriam-Webster, 2019). However, similar to Dubossarsky et al. (2017), we do not find causal links to semantic change from frequency, polysemy, or POS, which have been suggested in previous works (Hamilton et al., 2016;Dubossarsky et al., 2016).
In summary, our main contributions are threefold: (i) we formalize the analysis of change dynamics in language with a causal framework; (ii) we propose a semantic change metric that builds upon contextualized word representations; and (iii) we discover interesting insights about slang words and semantic change -e.g., showing that the change dynamics of slang words are different from those of nonslang words, with slang words exhibiting both more rapid frequency fluctuations and less semantic change.
2 Related Work

Semantic Change
A typical method for measuring semantic change is by comparing word representations across time periods (Gulordava and Baroni, 2011;Kim et al., 2014;Jatowt and Duh, 2014;Kulkarni et al., 2015;Eger and Mehler, 2016;Schlechtweg et al., 2019). With this approach, previous research has proposed laws relating semantic change to other linguistic properties (Dubossarsky et al., 2015;Xu and Kemp, 2015;Dubossarsky et al., 2016;Hamilton et al., 2016). For instance, Dubossarsky et al. (2016) find that verbs change faster than nouns, whereas Hamilton et al. (2016) discover that polysemous words change at a faster rate, while frequent words change slower. However, the validity of some of these results has been questioned via case-control matching (Dubossarsky et al., 2017), highlighting the influence of word frequency on the representations and thus on the semantic change metric (Hellrich and Hahn, 2016). Such analyses can indeed give stronger evidence for causal effects. In this work we take a methodologically different approach, considering observational data alone for our causal analysis.
The aforementioned works rely on fixed word representations, whereas more recent approaches (Hu et al., 2019;Giulianelli et al., 2020) have proposed semantic change measures based on contextualized word embeddings (Peters et al., 2018;Devlin et al., 2019), which can flexibly capture contextual nuances in word meaning. This has lead to a further stream of work on semantic change detection with contextualized embeddings (Martinc et al., 2020;Kutuzov and Giulianelli, 2020;Schlechtweg et al., 2020;Montariol et al., 2021;Kutuzov et al., 2021;Laicher et al., 2021). We build upon this line of work and extend them using principal component analysis (PCA) and a combination of distance metrics.

Characterization and Properties of Slang
Slang is an informal, unconventional part of the language, often used in connection to a certain setting or societal trend (Dumas and Lighter, 1978). It can reflect and establish a sense of belonging to a group (González, 1998;Bembe and Beukes, 2007;Carter, 2011) or to a generation (Citera et al., 2020;Earl, 1972;Barbieri, 2008). Mattiello (2005) highlights the role slang plays in enriching the language with neologisms, and claims that it follows unique word formation processes. Inspired by this, Kulkarni and Wang (2018) propose a data-driven model for emulating the generation process of slang words that Mattiello (2005) describes. Others have described the ephemerality of slang words (González, 1998;Carter, 2011), although this property has not been previously verified by computational approaches.

Causal Methodology for Change Dynamics
Examining change dynamics through a causal lens helps determine the existence of direct causal effects, by modeling the interactions between variables. For example, it allows us to conclude whether word type directly influences semantic change, or rather influences polysemy, which in turn causes semantic change. In this section, we first give a short overview of relevant work on causality, before presenting how we apply these concepts to word change dynamics.

Overview of Causal Discovery and Causal Inference
A common framework for causal reasoning is through causal directed acyclic graphs (DAGs) (Pearl, 2009). A causal DAG consists of a pair (G, P ) where G = (V, E) is a DAG and P is a probability distribution over a set of variables. Each variable is represented by a node v ∈ V , and the graph's edges e ∈ E reflect causal relationships. There are two main tasks in causality. Causal discovery is the task of uncovering the causal DAG that explains observed data. Assuming a causal DAG, the task of causal inference then concerns determining the effect that intervening on a variable, often referred to as treatment, will have on another variable, often referred to as outcome.
The causal DAG is often inferred from domain knowledge or intuition. However, in cases where we cannot safely assume a known causal struc-ture, causal discovery methods come in useful. Constraint-based methods (Spirtes et al., 2000) form one of the main categories of causal discovery techniques. These methods use conditional independence tests between variables in order to uncover the causal structure. To do so, they rely on two main assumptions: that the graph fulfills the global Markov property and the faithfulness assumption. Together they state that we observe conditional independence relations between two variables in the distribution if and only if these two variables are d-separated (Geiger et al., 1990) in the graphical model. For more details, we refer to Appendix D.1.
Causal inference is commonly approached with do-calculus (Pearl, 1995). We denote the intervention distribution P(Y |do(X = x)) to be the distribution of the outcome Y conditioned on an intervention do(X = x) which forces the treatment variable X to take on the value x. Note that this is in general not necessarily equal to P(Y |X = x). 2 When they are not equal, we say that there is confounding. Confounding occurs when there is a third variable Z, which causes both the treatment X and the outcome Y .
We say that there is a causal effect of X on Y if there exist x and x such that P(Y |do(X = x)) = P(Y |do(X = x )). (1) One way to quantify the causal effect is with the average causal effect (ACE): (2) To estimate the causal effect using observational data, we need to rewrite the intervention distribution using only conditional distributions. Assuming a causal DAG, this can be done with the truncated factorization formula (Pearl, 2009), for W ⊂ V , with X W being the variables in P corresponding to the nodes in W .

Causality for Change Dynamics
In this work, we estimate the direct causal effect of a word's type on its semantic change and frequency shift dynamics. In order to establish that such an effect exists, and to know which variables to control for, we turn to causal discovery algorithms. The variables in our causal graph additionally include frequency, polysemy and POS. For learning the causal graph, we choose the constraint-based PC-stable algorithm (Colombo and Maathuis, 2014), an order-independent variant of the well-known PC algorithm (Spirtes et al., 2000), discussed in Appendix D.1. We are learning a mixed graphical model (Lauritzen, 1996;Lee and Hastie, 2015), consisting of both continuous (e.g., frequency) and categorical (e.g., type) variables. For this reason we opt for constraint-based algorithms, allowing us to tailor the conditional independence tests according to the various data types.
Having learned the causal graph (Section 6.2), we proceed to estimate the ACE of word type on both semantic change and frequency shift using do-calculus (Section 6.3).

Slang and Nonslang Word Selection
We select 100 slang words and 100 nonslang words for our study, presented in Appendix E. In the tradeoff between statistical significance and time spent on computation and data collection, we found that a set of 200 words was enough to get highly significant results. The slang words are randomly sampled from the Online Slang Dictionary, 3 which provides well-maintained and curated slang word definitions as well as a list of 4,828 featured slang words as of June 2021. We limit the scope of our study to only encompass single-word expressions, and in so doing we filter out 2,169 multi-word expressions. To further clean the data, we also delete words with only one character and acronyms. Lastly, we limit the causal analysis to words that are exclusively either slang or nonslang, excluding "hybrid" words with both slang and nonslang meanings, such as "kosher" or "tool." Including words of this type would have interfered with the causal analysis by creating a hardcoded dependency between word type and polysemy, as these words by definition are polysemous. We do however perform a separate analysis of the hybrid words in Appendix C.
For the reference set of standard, nonslang, words we sample 100 words uniformly at random from a list of all English words, supplied by the wordfreq library in Python (Speer et al., 2018).

Data Collection
We curate a Twitter dataset from the years 2010 and 2020, which we select as our periods of reference, and collect the following variables: • Word type: Whether a word is slang or not • Word frequency: The average number of tweets containing the word per day in 2010 and 2020 (Section 5.2) • Frequency Shift: The relative difference in frequency the word has undergone between 2010 and 2020 (Section 5.3) • Polysemy: The number of senses a word has (Section 5.4) • Part of speech: A binary variable for each POS tag (Section 5.5) • Semantic change: The semantic change score of the word from 2010 to 2020 (Section 5.6)

Twitter Dataset
As a social media platform, Twitter data is rich in both slang and nonslang words. The Twitter dataset we curated comprises 170,135 tweets from 2010 and 2020 that contain our selected words. Sampling tweets from two separate time periods allows us to examine the semantic change over a 10-year gap. For every slang and nonslang word, and each of the two time periods, we obtain 200-500 random tweets that contain the word and were posted during the corresponding year. We keep each tweet's text, tweet ID, and date it was posted. As a post-processing step, we remove all URLs and hashtags from the tweets. To protect user privacy, we further replace all user name handles with the word "user." On average, we have 346 tweets per slang word and 293 tweets per nonslang word.

Word Frequency
We approximate a word's frequency by the average number of times it is tweeted within 24 hours. This average is calculated in practice over 40 randomly sampled 24 hour time frames in a given year, in each of which we retrieve the number of tweets containing the word. The frequencies are calculated separately for 2010 and 2020. Due to the growing Figure 2: Relative shift in frequency from 2010 to 2020, where a positive score corresponds to an increase in frequency. We see that slang words present both the highest increases and the highest decreases in frequency. Moreover, a large frequency decrease is observed exclusively in a set of slang words, indicating these words faded from usage during the decade.
popularity of social media, the number of tweets has significantly increased over the decade. Therefore, we divide the counts from 2020 by a factor of 6.4, which is the ratio between the average word counts in both years in our dataset. The frequencies from both years are then averaged to provide the frequency variable for the causal analysis.

Frequency Shift
We are now interested in analyzing the dynamics of frequency shifts. To evaluate the relative change in frequency for a given word w we take where, x k (w) is the frequency of word w in year k. This has been shown to be the only metric for relative change that is symmetric, additive, and normed (Tornqvist et al., 1985). Importantly, this measure symmetrically reflects both increases and decreases in relative frequency. The mean relative changes in frequency were −0.486(±1.644) for slang words and 0.533(±1.070) for nonslang words, where a positive score corresponds to an increase in frequency. As evident in Figure 2, not only did more slang words exhibit a decrease in frequency than nonslang ones, the words that showed the highest frequency increase are also slang. We also examine the absolute value of Eq. (4) to evaluate the degree of change, may it be a decrease or an increase. We find that, as expected, slang words have significantly higher changes in frequency than nonslang words (p < 0.05). See Appendix C for more details.

Polysemy
We define a word's polysemy score as the number of distinct senses it has 4 . For nonslang words, we take the number of senses the word has in Merriam Webster and for slang words we take the number of definitions on the Online Slang Dictionary. We use two separate resources as we find that no dictionary encapsulates both slang and nonslang words. The mean polysemy scores are (2.074 ± 2.595) for slang words and (3.079 ± 2.780) for nonslang words with a significant difference in distribution (p < 0.05) according to a permutation test, implying that the latter are used with a larger variety of meanings. In addition, the slang senses of the hybrid words exhibit a distribution similar to those of the slang words (Appendix C). More polysemous words tend to have a higher word frequency in our dataset -the log transform of frequency and polysemy display a highly significant (p < 0.001) linear correlation coefficient of 0.350.

Part of Speech
For each word, we retrieve four binary variables, indicating whether a word can be used as noun, verb, adverb or adjective, which were the four major POS tags observed in our data. To calculate these variables we run the NLTK POS tagger (Loper and Bird, 2002) on the tweets, and collect the distribution of POS tags for each word. Note that a word may have more than one POS tag, depending on the context in which it is used. Each of the binary variables is then set to be 1 if the word had the corresponding POS tag in at least 5% of its tweets and 0 otherwise.

Semantic Change Score
In this section we explain the details of how we obtain the semantic change scores. We start by fine-tuning a bi-directional language model on a slang-dense corpus (Section 5.6.1), after which we survey the literature and propose metrics (Section 5.6.2) that we use to perform an extensive experimentation study to find the most suitable one (Section 5.6.3). Finally, we apply this metric to our sets of slang and nonslang words on the Twitter data (Section 5.6.4).

Obtaining Contextualized Representations
We familiarize the bi-directional language model with slang words and the contexts in which they are used by fine-tuning it on the masked language modeling task. For this purpose we use a web-scraped dataset from the Urban Dictionary, previously collected by Wilson et al. (2020). After preprocessing and subsampling, the details of which can be found in Appendix A.1, we are left with a training set of 200, 000 slang-dense text sequences. As our bi-directional language model we select RoBERTa (Liu et al., 2019). Beyond performance gains compared to the original BERT (Devlin et al., 2019), we select this model since it allows for more subword units. We reason, that this could be useful in the context of slang words since potentially some of the sub-units used in these words would not have been recognized by BERT. We choose the smaller 125M parameter base version for computational reasons. We train the model using the Adam optimizer (Kingma and Ba, 2015) with different learning rates γ. The lowest loss on the test set was found with γ = 10 −6 , which we proceed with for scoring semantic change. For more details on training configurations, we refer to Appendix A.2.

Quantifying Semantic Change
In order to select a change detection metric, we evaluate our model on the SemEval-2020 Task 1 on Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020). This task provides the first standard evaluation framework for semantic change detection, using a large-scale labeled dataset for four different languages. We restrict ourselves to English and focus on subtask 2, which concerns ranking a set of 37 target words according to their semantic change between two time periods. The ranking is evaluated using Spearman's rank-order correlation coefficient ρ. 5 Our space of configurations includes layer representations, dimensionality reduction techniques and semantic change metrics.
Layer Representations: Previous work (Ethayarajh, 2019) has shown that embeddings retrieved from bi-directional language models are not isotropic, but are rather concentrated around a highdimensional cone. Moreover, the level of isotropy may vary according to the layer from which the representations are retrieved (Ethayarajh, 2019;Cai et al., 2021). This leads us to experiment with representations from different layers in our finetuned RoBERTa model, namely, taking only the first layer, only the last layer or summing all layers.
Dimensionality Reduction: To the best of our knowledge, only one previous semantic change detection approach (Rother et al., 2020) has incorporated dimensionality reduction, more specifically UMAP (McInnes et al., 2018). As the Euclidean distances in the UMAP-reduced space are very sensitive to hyperparameters and it does not retain an interpretable notion of absolute distances, it might be unsuitable for pure distance-based metrics like APD, and we therefore also experiment with PCA.
Metrics for Semantic Change: Given representations X t = {x 1,t , ..., x nt,t } for a particular word in time period t, we define the average pairwise distance (APD) between two periods as for some distance metric d(·, ·), where n t 1 , n t 2 are the number of words in each time period. We experiment with Euclidean distance d 2 (x 1 , x 2 ), cosine distance d cos (x 1 , x 2 ) and Manhattan distance d 1 (x 1 , x 2 ). Furthermore, we propose a novel combined metric. Note that d 2 (·, ·) ∈ [0, ∞] and d cos (·, ·) ∈ [0, 2]. Further note that Normalizing both metrics for a support in [0, 1], we get a combined metric with the same unit support to be the following average: We argue that this provides a more complete metric, capturing both absolute distance and the angle between vectors. In addition to the APD metrics, we experiment with distribution-based metrics (see Appendix B.1).  Table 1: Spearman's rank-order correlation coefficients between our semantic change scores and the ground truth across different dimensionality reduction techniques for APD (*: p < 0.05, **: p < 0.01).

Evaluating the Semantic Change Scores
We first compare the results for the three types of layer representations for different APD metrics, and note that summing all layer representations yields the best results. Consequentially, we proceed with the rest of the experiments using only these representations. For both PCA and UMAP, we experiment with projecting the representations down to h ∈ {2, 5, 10, 20, 50, 100} dimensions. These combinations are tested together with the APD metrics as presented in Section 5.6.2 as well as the distribution-based metrics described in Appendix B. The latter do not however in general display significant correlations.
We present a small subset of the scores resulting from the APD configurations in Table 1, highlighting our finding that both PCA dimensionality reduction and using a combined metric improve the performance. More results and comparisons to baselines are presented in Appendix B.3. We observe that the proposed combined metric consistently outperforms both d 2 and d cos across values of h for PCA. We also note that UMAP projections perform poorly with the APD metrics and that projecting down to 50-100 dimensions seems to be optimal, which maintains 70-85% of the variance as we illustrate in Appendix B.2. In addition, both norm-based metrics d 1 and d 2 perform worse with dimensionality reduction. As our final metric, we choose the best performing configuration on SemEval, with PCA h = 100 and the combined metric, as seen in Table 1.

Semantic Change Scores for Slang and Nonslang Words on the Twitter Dataset
We obtain semantic change scores using the Twitter dataset described in Section 5.1. For the semantic change analysis, we exclude words that have less than 150 tweets in each time period within the dataset, which leaves us with 80 slang and 81 non- slang words. We also normalize the scores according to the sample. The resulting semantic change scores are shown in Figure 3. The mean semantic change scores are 0.564(±0.114) for slang words and 0.648(±0.084) for nonslang words. The difference in semantic change score distributions is significant (p < 0.001) via a permutation test. The word with the highest semantic change score of 1 is "anticlockwise," and the word with the lowest score of 0 is "whadja." 6 Causal Analysis

Preparation for Causal Discovery
PC-stable is constraint-based and thus makes use of conditional independence tests. In the case of continuous Gaussian variables, we can perform partial correlation tests to assess conditional independence, since zero partial correlation in this case is equivalent to conditional independence (Baba et al., 2004). As word frequency has been suggested to follow a lognormal distribution (Baayen, 1992), we take the log transform of it. The continuous variables semantic change, frequency change and log-frequency are then all assumed to be approximated well by a Gaussian distribution, which is confirmed by diagnostic density and Q-Q plots (displayed in Appendix D.2). We categorize the discrete polysemy variable, experimenting with nine different plausible categorizations for the sake of robustness of the results. Word type and POS are categorical in nature. For the categorical variables and for mixes of categorical and continuous variables, we perform chi-squared mutual information based tests Figure 4: DAG representing the causal relationships in our dataset. We see that word type directly influences frequency shift, semantic change and polysemy, and polysemy in turn influences frequency. (Edwards, 2000), since the approximate null distribution of the mutual information is chi-squared (Brillinger, 2004). For all conditional independence tests we experiment with significance levels α ∈ {0.01, 0.03, 0.05}.

Resulting Causal Structure
In Figure 4 we see the result from the above approach, with dashed lines representing edges that were apparent in most but not all of the configurations. See Appendix D.3 for a sensitivity analysis.
We first observe that word type has a direct causal effect on both the semantic change score and the frequency shift, without any confounding from the other variables. We also note a direct influence of word polysemy on frequency.
Moreover, none of the four POS categories, which are all gathered in one node in Figure 4, have a causal link to any of the other variables. We additionally observe a dependency between word type and polysemy. This edge could not be oriented by the PC-stable algorithm, however we manually orient it as outgoing from type and ingoing to polysemy, since an intervention on type should have a causal effect on the number of word senses and not vice versa. It is also interesting to note that polysemy does not seem to have a causal effect on semantic change. Its association with semantic change (p < 0.05, rejecting the null hypothesis of independence between polysemy and semantic change) is instead confounded by word type.

Causal Effects
In our case of no confounders, evaluating the ACE of word type on semantic change is straightforward, as it reduces to the difference between the conditional expectations: See Appendix D.4 for a derivation. The case of frequency shift is analogous. We estimate the expectations by the sample means on the normalized values and get an average causal effect of 0.084, which is a highly significant value (p < 0.001) based on a t-test. For the observed changes in relative frequency, calculated according to Eq. (4), we get an average causal effect of 1.017 (p < 0.001 via a t-test).

Discussion
We analyze the dynamics of frequency shift and semantic change in slang words, and compare them to those of nonslang words. Our analysis shows that slang words change slower in semantic meaning, but adhere to more rapid frequency fluctuations, and are more likely to greatly decrease in frequency. Our study is the first computational approach to confirm this property in slang words (González, 1998;Carter, 2011).
To ensure that this is the result of a causal effect, and not mediated through another variable or subject to confounders, we model the data with a causal DAG, by also considering the potential interacting variables polysemy, frequency and POS. We discover that there is no influence of confounders, nor are there mediators between a word's type and its semantic change or its frequency shift, which confirms a direct causal effect. This means that if we could intervene on a word's type, i.e., by setting it to be slang instead of nonslang or vice versa, we would expect its change dynamics to differ.
Our results are consistent with those of Dubossarsky et al. (2017), which found that neither the law relating semantic change to frequency, polysemy (Hamilton et al., 2016) nor prototypicality (Dubossarsky et al., 2015) were found to be as strong as previously thought after a case-control study using a scenario without semantic change. Indeed, there is no directed path from polysemy or frequency to semantic change in our causal graph, but they are both influenced by word type. We leave for future research to explore whether other word categorizations, e.g., related to specific domains, languages or phonetic aspects, sustain this result.
In addition, our analysis does not support the claim that POS could underlie semantic change (Dubossarsky et al., 2016). We note however that as our vocabulary contains 50% slang words, the results need not be consistent with results obtained with a word sample drawn from standard language.
Moreover, in the causal structure we discover that word polysemy has a direct effect on word frequency, which is in line with previous linguistic studies showing that a word's frequency grows in an S-shaped curve when it acquires new meanings (Kroch, 1989;Feltgen et al., 2017), as well as a known positive correlation between polysemy and frequency (Lee, 1990;Casas et al., 2019). We emphasize that this relationship is not merely an artifact of contextualized word representations being affected by frequency (Zhou et al., 2021), since our polysemy score does not rely on word representations as in Hamilton et al. (2016). Our approach is however not without drawbacks -the polysemy variable is collected from dictionaries, which may be subjective in their assignments of word senses.
Our study, along with previous work on the dynamics of semantic change, is limited by mainly considering distributional factors. Linguists have suggested that sociocultural, psychological and political factors may drive word change dynamics (Blank, 1999;Bochkarev et al., 2014), and slang words are not an exception. Although challenging to measure, the influence of such factors on slang compared to nonslang words would be interesting to examine in future work.
In conclusion, we believe that a causal analysis as we have presented here provides a useful tool to understand the underlying mechanisms of language. Complementing the recent emergence of research combining causal inference and NLP (Feder et al., 2021), we have shown that tools from causality can also be beneficial for gaining new insights in diachronic linguistics.

Conclusion
In this work, we have analyzed the diachronic mechanisms of slang language with a causal methodology. This allowed us to establish that a word's type has a direct effect on its semantic change and frequency shift, without mediating effects from other distributional factors.

Acknowledgments
We would like to thank Steven R. Wilson for providing us with the Urban Dictionary data and Walter Rader for providing us with a curated set of slang words from the Online Slang Dictionary. For the Twitter data, we are thankful to have been able to get access to Twitter's Academic Research Track. Finally, we gratefully acknowledge feedback and helpful comments from Mario Giulianelli, Yifan Hou, Bernhard Schölkopf and three anonymous reviewers.
This material is based in part upon works supported by the John Templeton Foundation (grant #61156); by a Responsible AI grant by the Haslerstiftung; by an ETH Grant (ETH-19 21-1); by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B; and by the Machine Learning Cluster of Excellence, EXC number 2064/1 -Project number 390727645.

Ethical Considerations
Our dataset is composed solely of English text. This means that our analysis applies uniquely to the English language, and results may differ in other languages. Moreover, for the purpose of this study, we curated a dataset of 170, 135 tweets. We emphasize that in order to protect the anonymity of users, we remove all author IDs from the data, and replace all usernames with the general token "user." In the Urban Dictionary dataset we received from Wilson et al. (2020) of the submitter and a timestamp. As the data is crowd-sourced, many of these entries are noisy and of low quality. We therefore filter the lower quality definitions out before fine-tuning RoBERTa. After performing data exploration, we came up with two criteria that we found the most indicative of a definition's quality: the number of upvotes it got, and its upvote/downvote ratio. The distribution of upvotes, downvotes and the upvote/downvote ratios in the dataset can be seen in Figure 6 below. We also note that the number of submissions to Urban Dictionary is relatively well-spread, see Figure 5. This implies that we do not have a strong bias towards more recently popularized slang terms in the dataset, and that we do have representation of the entire time span of interest; 2010 − 2020.
We keep the entries having more than 20 upvotes and an upvote/downvote ratio of at least 2. This leaves us with 488, 010 Urban Dictionary entries, out of which we randomly sample 100, 000 to reduce the computation time in the fine-tuning process. We use both the definitions and the word usage examples for fine-tuning, producing a final dataset of 200, 000 sequences.

A.2 Training
We randomly split the data into 80% train and 20% test, before training for 10 epochs with an early stopping with patience 3. The batch size was set to 1 in the interest of memory constraints. Following For the learning rate, we argue that since the initialized parameters should provide a solution which is already close to the optimum when evaluating on our dataset (our fine-tuning being the very same masked language modeling task as RoBERTa has already been trained on), the learning rate should be smaller. Thus, instead of picking the learning rate γ = 6 · 10 −4 as was done by Liu et al. (2019), we experiment with γ ∈ {10 −4 , 10 −5 , 10 −6 , 10 −7 }. Training was done using an NVIDIA GeForce GTX 1080 8GB GPU and took around 1 to 1.5 days per model.

B.1 Distribution-based Metrics
Method: In addition to the distance-based APD metrics, we experiment with two distribution-based ones, namely entropy difference (ED) & Jensen-Shannon Divergence (JSD) (Giulianelli et al., 2020). We assume a categorical distribution over a set of K w word senses for word w and time period t. The word sense s w i of an occurrence i is then given by: Given two time periods of word sense distributions, we define the ED metric as with entropy H(·). The JSD is given as: and KL(·||·) being the KLdivergence.
We obtain the word sense distributions via a clustering of the representations from both time periods. We experiment with K-Means and Gaussian Mixture Models (GMMs), the latter proposed due to its ability to find more general cluster shapes. We also experiment briefly with Affinity Propagation, which has been used in previous semantic change detection work (Martinc et al., 2020;Kutuzov and Giulianelli, 2020;Montariol et al., 2021). However, we find it to be ill-suited for our purposes since it results in an excessive amount of clusters in comparison to how a human would classify word senses.
For both K-means and GMM, we experiment with selecting the optimal K w ∈ [1, 10] through two different procedures. The first one is a slight extension of the method from Giulianelli et al. (2020) -we select the K w which optimizes the silhouette score (Rousseeuw, 1987) for a set of different initializations. Their approach does not consider the single cluster case however, so we extend it by setting K w = 1 when the best silhouette score is below a threshold of 0.1. For K-Means, we further experiment with an automatic elbow method 6 for the sum of squared distances to the cluster centroids, which decreases monotonically with the number of clusters. We again select the cluster assignments with the largest silhouette score for multiple random initalizations. For GMM, we further experiment with taking the model which corresponds to the best Bayesian Information Criterion (Schwarz, 1978).
Clustering examples: In Figure 7 we see three clusters found for "gag." They do not seem to correspond to word senses however: An example from the first cluster is "user i need a pic of you begging if i ' m boiling these because boiled eggs make me gag . :d," an example from the second cluster is "lmao rt user user user so i tried that tuna with cheese and my gag reflexes were in full affect !" and an example from the third cluster is "gag me with a spoon" -all seemingly referring to the sensation of being about to vomit.
We show another example in Figure 8 of the word "gnarly," this time reduced to 2 dimensions using UMAP. Gnarly has three meanings according to the Online Slang Dictionary: It can either mean very good / excellent / cool, gross / disgusting or painful / dangerous. These three word senses are not separated by UMAP and GMM, for instance both "its a good thing one of my roomies is a dude , who else would kill gnarly spiders in my room when i start to hyperventilate" and "rt user bro my wreck on the scooter was so gnarly like it was fun i love shit like that . i wish i could've been on jackass" are put in the first cluster.

B.2 Variance Explained by PCA components
Consider Figure 9 for example plots of how much variance is preserved with PCA on the contextualized representations.

B.3 Results
We further present more results of the experimentation on the SemEval-2020 Task 1 Subtask 2. All tables show the Spearman's rank-order correlation between the change metrics and the ground truths.
In Table 2 we compare our best performing setup to the three best performing previous approaches on SemEval-2020 Task 1 Subtask 2. We see that only Kutuzov and Giulianelli (2020) display a higher score, which might be partially explained by the fact that they fine-tune their model on the SemEval test corpora. We do not do this since our main goal is not to beat state-of-the-art on the shared task, but rather to find a good enough model to detect semantic change in slang.
The results comparing the layer representations can be observed in Table 3. As a side observation we also note that the less isotropic first layer representations seem to perform better than the more isotropic last layer representations.  Table 4 we present a comparison across different layer representations for both APD-based and distribution-based metrics. We observe that none of the distribution-based metrics give significant results, even when used with dimensionality reduction techniques. While a few of them do have a slight positive correlation, we omit this approach altogether. The APD results on the other hand show a high correlation for many of the configurations, providing an indication of the APD's robustness in detecting semantic change. We show a selection of these in Table 6.

C Appendix -Hybrid Words
We define hybrid words as words that have both a slang and nonslang meaning, i.e., occurring in both Online Slang Dictionary (OSD) and Merriam Webster (MW). In this section, we compare the polysemy, semantic change, frequency shift as well as the absolute frequency change patterns of hybrid words to slang and nonslangs. Polysemy is collected for hybrid words from OSD and MW separately. Since the MW dictionary may also contain slang meanings, we filter out definitions labeled as slang, informal or vulgar from these scores. The mean polysemy scores of the slang words are (2.074 ± 2.568) and the mean OSD polysemy scores of the hybrid words are (2.580 ± 2.178), with a non-significant difference (p > 0.05) in distribution according to a permutation test. This tells us that we are not skewing   the polysemy score distribution of the slang words by excluding hybrid words. As for the nonslang meanings of the hybrid words, we get a mean polysemy score of (6.880 ± 6.080) which is significantly different (p < 0.001) from those of the nonslang words (3.079 ± 2.780). This is an interesting observation, implying that had we included nonslang words with hybrid meaning in our nonslang words sample, the difference in polysemy between slang and nonslang words would have been larger. Some example words from this category with high MW polysemy scores include "split," "down" and "walk." For the relative frequency changes, we present the results as histograms in Figure 10. The frequency changes in hybrid words seem to fall between those of the slang words and the nonslang words. We observe a mean and standard deviation of −0.154 and 0.608 respectively.
In addition, we compare the absolute relative frequency changes as described in Section 5.3 across Figure 10: Relative difference in frequency between 2020 and 2010, for slang, nonslang and hybrid words, where a positive score corresponds to an increase in frequency. Figure 11: Absolute value of relative difference in frequency between 2020 and 2010, for slang, nonslang and hybrid words, where a larger score corresponds to a larger absolute increase in frequency. slang, nonslang and hybrid words. The histograms are presented in Figure 11. We observe, respectively, a mean and standard deviation of 1.246 & 1.180 for the slang words, 0.950 & 0.724 for the nonslang words and 0.482 & 0.402 for the hybrid words. The difference in mean is significant between the slang and nonslang words (p < 0.05), indicating that slang words have undergone a larger absolute change in frequency. Furthermore, we note a highly significant difference (p < 0.001) in the mean of the hybrid words compared to both the slang and nonslang word means.
We compare the normalized semantic change scores between the slang, nonslang and hybrid words. Histograms over the semantic change scores Figure 12: Difference in semantic change score between 2010 and 2020 for slang, nonslang and hybrid words, where a larger score corresponds to a more pronounced semantic change. are shown in Figure 12. We observe that the distribution over hybrid change scores seem again to be centered between the slang and nonslang distributions, with mean 0.621 ± 0.073. According to a permutation text, there is a significant difference in semantic change both between hybrid and slang words (p < 0.001) and between hybrid and nonslang words (p < 0.05).

D.1 Preliminary on Constraint-based Causal Discovery
Assumptions The constraint-based causal discovery algorithms make use of two main assumptions, namely the global Markov assumption and the faithfulness assumption. The global Markov property (Peters et al., 2017) holds if all dseparations (defined below) encoded in the causal graph imply conditional independencies in the distribution over the variables contained in the graph. More formally, for a graph G = (V, E) and distribution P over the variables X V it holds that for any disjoint subsets A, B and C of V The faithfulness assumption states the converse of the global Markov assumption: All conditional independencies in the distribution are encoded by d-separations in the graph.
d-separation Two nodes A, B ∈ V are said to be d-separated (Geiger et al., 1990), by a set of nodes Z ⊂ V if for all paths between A and B, at least one of the following holds: • The path contains a directed chain A · ·· → C → · · · B or A · ·· ← C ← · · · B such that C ∈ Z • The path contains a fork A · ·· ← C → · · · B such that C ∈ Z • The path contains a collider A · ·· → C ← · · · B such that C / ∈ Z or C / ∈ Z ∀C ∈ desc(C) (i.e., neither C nor any of its descendants is in Z) We would then denote X A ⊥ d X B |X Z .
Markov Equivalence Constraint-based algorithms use conditional independence tests in order to identify a Markov equivalence class of DAGs. Two DAGs are defined to be Markov equivalent if they have the same skeleton (edges omitting direction) and v-structures. The three vertices A, B and C form a v-structure if A → B ← C and A and C are not directly connected by an edge. Alternatively, two DAGs are Markov equivalent if they describe the same set of d-separation relationships. A Markov equivalence class is the set of all Markov equivalent DAGs.