Dynamic Contextualized Word Embeddings

Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets.


Introduction
Over the last decade, word embeddings have revolutionized the field of NLP. Traditional methods such as LSA (Deerwester et al., 1990), word2vec (Mikolov et al., 2013a, GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017) compute static word embeddings, i.e., they represent words as a single vector. From a theoretical standpoint, this way of modeling lexical semantics is problematic since it ignores the variability of word meaning in different linguistic contexts (e.g., polysemy) as well as different extralinguistic contexts (e.g., temporal and social variation).
The first shortcoming was addressed by the introduction of contextualized word embeddings that represent words as vectors varying across linguistic contexts. This allows them to capture more complex characteristics of word meaning, including polysemy. Contextualized word embeddings are widely used in NLP, constituting the semantic backbone of pretrained language models (PLMs) such as ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), XLNet  (Yang et al., 2019), ELECTRA (Clark et al., 2020), and T5 (Raffel et al., 2020).
A concurrent line of work focused on the second shortcoming of static word embeddings, resulting in various types of dynamic word embeddings. Dynamic word embeddings represent words as vectors varying across extralinguistic contexts, in particular time (e.g., Rudolph and Blei, 2018) and social space (e.g., Zeng et al., 2018).
In this paper, we introduce dynamic contextualized word embeddings that combine the strengths of contextualized word embeddings with the flexibility of dynamic word embeddings. Dynamic contextualized word embeddings mark a departure from existing contextualized word embeddings (which are not dynamic) as well as existing dynamic word embeddings (which are not contextualized). Furthermore, as opposed to all existing dynamic word embedding types, they represent time and social space jointly. While our general framework for training dynamic contextualized word embeddings is modelagnostic (Figure 1), we present a version using a PLM (BERT) as the contextualizer, which allows for an easy integration within existing architectures. Dynamic contextualized word embeddings can serve as an analytical tool (e.g., to track the emergence and spread of semantic changes in online communities) or be employed for downstream tasks (e.g., to build temporally and socially aware text classification models), making them beneficial for various areas in NLP that face semantic variability. We illustrate application scenarios by performing exploratory experiments on English data from ArXiv, Ciao, Reddit, and YELP.
Contributions. We introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a PLM, dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks. We showcase potential applications by means of qualitative and quantitative analyses. 1 2 Related Work

Contextualized Word Embeddings
The distinction between the non-contextualized core meaning of a word and the senses that are realized in specific linguistic contexts lies at the heart of lexical-semantic scholarship (Geeraerts, 2010), going back to at least Paul (1880). In NLP, this is reflected by contextualized word embeddings that map type-level representations to token-level representations as a function of the linguistic context (McCann et al., 2017). As part of PLMs (Peters et al., 2018a;Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019;Clark et al., 2020;Raffel et al., 2020), contextualized word embeddings have led to substantial performance gains on a variety of tasks compared to static word embeddings that only have type-level representations (Deerwester et al., 1990;Mikolov et al., 2013a,b;Pennington et al., 2014;Bojanowski et al., 2017).
Sociolinguistics has shown that temporal and social variation in language are tightly interwoven: innovations such as a new word sense in the case of lexical semantics spread through the language community along social ties (Milroy, 1980(Milroy, , 1992Labov, 2001;Pierrehumbert, 2012). However, most proposed dynamic word embedding types cannot capture more than one dimension of variation. Recently, a few studies have taken first steps in this direction by using genre information within a Bayesian model of semantic change (Frermann and Lapata, 2016;Perrone et al., 2019) and including social variables in training diachronic word embeddings (Jawahar and Seddah, 2019). In addition, to capture the full range of lexical-semantic variability, dynamic word embeddings should also be contextualized. Crucially, while contextualized word embeddings have been used to investigate semantic change (Giulianelli, 2019;Hu et al., 2019;Giulianelli et al., 2020;Kutuzov and Giulianelli, 2020;Martinc et al., 2020a,b), the word embeddings employed in these studies are not dynamic, i.e., they represent a word in a specific linguistic context by the same contextualized word embedding independent of extralinguistic context or are fit to different time periods as separate models. 2

Model Overview
Given a sequence of words X = x (1) , . . . , x (K) and corresponding non-contextualized embeddings E = e (1) , . . . , e (K) , contextualizing language models compute the contextualized embedding of a particular word x (k) , h (k) , as a function c of its non-contextualized embedding, e (k) , and the noncontextualized embeddings of words in the left context X (<k) and the right context X (>k) , 3 Crucially, while h (k) is a token-level representation, e (k) is a type-level representation and is modeled as a simple embedding look-up. Here, in order to take the variability of word meaning in different extralinguistic contexts into account, we depart from this practice and model e (k) as a function d that depends not only on the identity of x (k) but also on the social context s i and the temporal context t j in which the sequence X occurred, 2 It is interesting to notice that contextualized word embeddings so far have performed worse than non-contextualized word embeddings on the task of lexical semantic change detection (Kaiser et al., 2020;.
3 Some contextualizing language models such as GPT-2 (Radford et al., 2019) only operate on X (<k) .ẽ (1)ẽ(2)ẽ(3) Figure 2: Model architecture. Words are mapped to dynamic embeddings by the parts of the dynamic component ( ), which are then contextualized by the contextualizer ( ). The output of the contextualizer is used to compute the task-specific loss L task .
Dynamic contextualized word embeddings are hence computed in two stages: words are first mapped to dynamic type-level representations by d and then to contextualized token-level representations by c (Figures 1 and 2). This two-stage structure follows work in cognitive science and linguistics that indicates that extralinguistic information is processed before linguistic information by human speakers (Hay et al., 2006). Since many words in the core vocabulary are semantically stable across social and temporal contexts, we place a Gaussian prior on e whereẽ (k) denotes a non-dynamic representation of x (k) . Combining Equations 2 and 3, we write the function d as where o (k) ij denotes the vector offset from x (k) 's non-dynamic embeddingẽ (k) , which is stable across social and temporal contexts, to its dynamic embedding e (k) ij , which is specific to s i and t j . The distribution of o We enforce Equation 5 by including a regularization term in the objective function (Section 3.4).

Contextualizing Component
We leverage a PLM for the function c, specifically BERT (Devlin et al., 2019). Denoting with E ij the sequence of dynamic embeddings corresponding to X in s i and t j , the dynamic version of Equation 1 becomes We also use BERT, specifically its pretrained input embeddings, to initialize the non-dynamic embeddingsẽ (k) , which are summed with the vector offsets o ) and fed into BERT. Using a PLM for c has the advantage of making it easy to employ dynamic contextualized word embeddings for downstream tasks by adding a taskspecific layer on top of the PLM.

Dynamic Component
We model the vector offset o (k) ij as a function of the word x (k) , which we represent by its non-dynamic embeddingẽ (k) , as well as the social context s i , which we represent by a time-specific embedding s ij . We use BERT's pretrained input embeddings forẽ (k) . 4 We combine these representations in a time-specific feed-forward network, where denotes concatenation. To compute the social embedding s ij , we follow common practice in the computational social sciences and represent the social community as a graph G = (S, E), where S is the set of social units s i , and E is the set of edges between them (Section 4). We use a timespecific graph attention network (GAT) as proposed by Veličković et al. (2018) to encode G, 5 We initializes i with node2vec (Grover and Leskovec, 2016) embeddings.
To model the temporal drift of the dynamic embeddings e (k) ij , we follow previous work on dynamic word embeddings (Bamler and Mandt, 2017;Rudolph and Blei, 2018) and impose a random walk prior over o 4 We also tried to learn separate embeddings in the dynamic component, but this led to worse performance. 5 We also tried a model with a feed-forward network instead of graph attention, but it consistently performed worse. with j = j − 1. This type of Gaussian process is known as Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930) and is commonly used to model time series (Roberts et al., 2013). The random walk prior enforces that the dynamic embeddings e (k) ij change smoothly over time.

Model Training
The combination with BERT makes dynamic contextualized word embeddings easily applicable to different tasks by adding a task-specific layer on top of the contextualizing component. For training the model, the overall loss is where L task is the task-specific loss, and L prior a and L prior w are the regularization terms that impose the anchoring and random walk priors on the typelevel offset vectors, It is common practice to set λ a λ w (Bamler and Mandt, 2017;Rudolph and Blei, 2018). Here, we set λ a = 10 −3 · λ w , which reduces the number of tunable hyperparameters. We place the priors only on frequent words in the vocabulary (Section 5.1), taking into account the observation that the vocabulary core constitutes the best basis for dynamic word embeddings (Hamilton et al., 2016b).

Data
We fit dynamic contextualized word embeddings to four datasets with different linguistic, social, and temporal characteristics, which allows us to investigate factors impacting their utility. Each dataset D consists of a set of texts (e.g., reviews) written by a set of social units S (e.g., users) over a sequence of time periods T (e.g., years). Furthermore, the social units are connected by a set of edges E within a social network G. Table 1 provides summary statistics of the four datasets.
ArXiv. ArXiv is an open-access distribution service for scientific articles. Recently, a dataset of all papers published on ArXiv with corresponding metadata was released. 6 For this study, we  Table 1: Dataset statistics. |D|: number of data points; µ |X| : average number of tokens per text; |S|: number of nodes in network; |E|: number of edges; µ d : average node degree; µ π : average shortest path length between two nodes; ρ: network density; |T |: number of time points; t 1 : first time point; t |T | : last time point. In cases where years are the temporal unit, we also provide the first and last month included in the data.
use ArXiv's subject classes (e.g., cs.CL) as social units and extract the abstracts of papers published between 2001 and 2020 for subjects with at least 100 publications in that time. 7 To create the network, we measure the overlap in authors between subject classes as the Jaccard similarity of corresponding author sets, resulting in a similarity matrix S. Based on S, we define the adjacency matrix G of G, whose elements are i.e., there is an edge between subject classes i and j if the Jaccard similarity of author sets is greater than θ. We set θ to 0.01. 8 Ciao. Ciao is a product review site on which users can mark explicit trust relations towards other users (e.g., if they find their reviews helpful). A dataset containing reviews covering the time period from 2000 to 2011 has been made publicly available (Tang et al., 2012). 9 We use the trust relations to create a directed graph. Since we also perform sentiment analysis on the dataset, we follow Yang and Eisenstein (2017) in converting the five-star rating range into two classes by discarding threestar reviews and treating four/five stars as positive and one/two stars as negative.
Reddit. Reddit is a social media platform hosting discussions about a variety of topics. It is divided into smaller communities, so-called subreddits, which have been shown to be highly conducive to linguistic dynamics (del Tredici and Fernández, 2018;del Tredici et al., 2019a). A full dump of public Reddit posts is available online. 10 We retrieve all comments between September 2019 and April 7 We treat subject class combinations passing the frequency threshold (e.g., cs.CL&cs.AI) as individual units. 8 We tried other values of θ, but the results were similar. 9 https://www.cse.msu.edu/˜tangjili/ trust.html 10 https://files.pushshift.io/reddit/ comments 2020, which allows us to examine the effects of the rising Covid-19 pandemic on lexical usage patterns. We remove subreddits with fewer than 10,000 comments in the examined time period and sample 20 comments per subreddit and month. For each subreddit, we compute the set of users with at least 10 comments in the examined time period. Based on this, we use the same strategy as for ArXiv to create a network based on user overlap.
YELP. Similarly to Ciao, YELP is a product review site on which users can mark explicit friendship relations. A subset of the data has been released online. 11 We use the friendship relations to create a directed graph between users. Since we also use the dataset for sentiment analysis, we again discard three-star reviews and convert the five-star rating range into two classes.
The fact that the datasets differ in terms of their social and temporal characteristics allows us to examine which factors impact the utility of dynamic contextualized word embeddings. We highlight, e.g., that the datasets differ in the nature of their social units, cover different time periods, and exhibit different levels of temporal granularity. We randomly split all datasets into 70% training, 10% development, and 20% test. We apply stratified sampling to make sure the model sees data from all time points during training. See Appendix A.1 for details about data preprocessing.

Embedding Training
We fit dynamic contextualized word embeddings to all four datasets, using BERT BASE (uncased) as the contextualizer and masked language modeling as the training objective (Devlin et al., 2019), i.e., we  add a language modeling head on top of BERT. 12 To estimate the goodness of fit, we measure masked language modeling perplexity and compare against finetuned (non-dynamic) contextualized word embeddings, specifically BERT BASE (uncased). See Appendix A.2 for details about implementation, hyperparameter tuning, and runtime. Dynamic contextualized word embeddings (DCWE) yield fits to the data similar to and (sometimes significantly) better than non-dynamic contextualized word embeddings (CWE), which indicates that they successfully combine extralinguistic with linguistic information (Table 2). 13

Ablation Study
To examine the relative importance of temporal and social information for dynamic contextualized word embeddings, we perform two experiments in which we ablate social context and time ( Figure 3). In social ablation (SA), we train dynamic contextualized word embeddings where the vector offset depends only on word identity and time, not social context, keeping the random walk prior between subsequent time slices. In temporal ablation (TA), we use one social component for all time slices. See Appendix A.3 for details about implementation, hyperparameter tuning, and runtime.
Temporal ablation has more severe consequences than social ablation (Table 3). On Ciao, the social component does not yield better fits on the data at all, which might be related to the fact that many users in this dataset only have one review, and that its social network has the lowest density as well as the smallest average node degree out of all considered datasets (Table 1). 12 For a given dataset, we only compute dynamic embeddings for tokens in BERT's input vocabulary that are among the 100,000 most frequent words. For less frequent tokens, we input the non-dynamic BERT embedding. 13 Statistical significance is tested with a Wilcoxon signedrank test (Wilcoxon, 1945;Dror et al., 2018).ẽ (1)ẽ(2)ẽ(3)

Qualitative Analysis
Do dynamic contextualized word embeddings indeed capture interpretable dynamics in word meaning? To examine this question qualitatively, we define as sim (k) ij the cosine similarity between the non-dynamic embedding of x (k) ,ẽ (k) , and the dynamic embeddings of x (k) given social and temporal contexts s i and t j , e where φ ij is the angle betweenẽ (k) and e (k) ij (Figure 1). 14 To find words with a high degree of variability, we compute the standard deviation of sim (k) ij based on all s i and t j in which a given word x (k) occurs in the data, where we take the development set for D. Looking at the top-ranked words according to σ (k) sim , we observe that they exhibit pronounced Testing is not required if a patient has no symptoms, mild symptoms, or is a returning traveller and is isolating at home. extralinguistically-driven semantic dynamics in the data. For Reddit, e.g., many of the top-ranked words have experienced a sudden shift in their dominant sense during the Covid-19 pandemic such as "isolating" and "testing" ( ij (i.e., the cosine distance is larger) than the ones in which the more general sense is dominant. Such shortterm semantic shifts, which have attracted growing interest in NLP recently (Stewart et al., 2017;del Tredici et al., 2019a;Powell and Sentz, 2020), can result in lasting semantic narrowing if speakers become reluctant to use the word outside of the more specialized sense (Anttila, 1989;Croft, 2000;Robinson, 2012;Bybee, 2015).
Thus, the qualitative analysis suggests that the dynamic component indeed captures extralinguistically-driven variability in word meaning. In Sections 5.4 and 5.5, we will demonstrate by means of two example applications how this property can be beneficial in practice.

Exploration 1: Semantic Diffusion
We will now provide a more in-depth analysis of social and temporal dynamics in word meaning to showcase the potential of dynamic contextualized word embeddings as an analytical tool. Specifically, we will analyze how changes in the dominant sense of a word diffuse through the social networks of ArXiv and Reddit. For ArXiv, we will examine the deep learning sense of the word "network". For Reddit, we will focus on the medical sense of the word "mask". We know that these senses have become more widespread over the last few years (ArXiv) and months (Reddit), but we want to test if dynamic contextualized word embeddings can capture this spread, and if they allow us to gain new insights about the spread of semantic associations through social networks in general.
To perform this analysis, let r (k,k ) ij be the rank of x (k ) 's embedding among the N nearest neighbors of x (k) 's embedding, given social and temporal contexts s i and t j . We then define aŝ a semantic similarity score between x (k) and x (k ) . r (k,k ) ij is maximal when x (k ) 's embedding is closest to x (k) 's embedding. We setr , we measure dynamics in the semantic similarity between "network" and "learning" (representing the deep learning sense of "network") as well as "mask" and "vaccine" (representing the medical sense of "mask"). For all social and temporal contexts in which "network" and "mask" occur, we computer (k,k ) ij between their socially and temporally dynamic embeddings on the one hand and time-specific centroids of "learning" and "vaccine" averaged over social contexts on the other, employing contextualized versions of the dynamic embeddings. 15 In cases where "network" or "mask" occur more than once in a certain social and temporal context, we take the mean ofr The dynamics ofr (k,k ) ij reflect how the changes in the dominant sense of "network" and "mask" spread through the social networks (Figure 4). For "network", we see that the deep learning sense was already present in computer science and physics in 2013, where neural networks have been used since the 1980s. It then gradually spread from these two epicenters, with a major intensification after 2016. For "mask", we also see a gradual diffusion, with a major intensification after 03/2020. , a score for semantic similarity between 0 (no similarity) and 100 (very similar), for "network" and "learning" in ArXiv as well as "mask" and "vaccine" in Reddit. The different node shapes in the ArXiv network represent the three major ArXiv subject classes: computer science (square), mathematics (triangle), and physics (circle). For "network", the change towards the deep learning sense spread gradually from computer science and physics. For "mask", the change towards the medical sense also spread gradually, with a major intensification after 03/2020.
On what paths do new semantic associations spread through the social network? In complex systems theory, there are two basic types of random motion on networks: random walks, which consist of a series of consecutive random steps, and random flights, where step lengths are drawn from the Lévy distribution (Masuda et al., 2017). To probe whether there is a dominant type of spread for the two examples, we compute for each time slice t j what proportion of nodes that haver (k,k ) ij > 0 for the first time at t j (i.e., the change in the dominant sense has just arrived) are neighbors of nodes that already hadr (k,k ) ij > 0 before t j . This analysis shows that random walks are the dominant type of spread for "network", but random flights for "mask" (Figure 5). Intuitively, it makes sense that a technical concept such as neural networks spreads through the direct contact of collaborating scientists rather than through more distant forms of reception (e.g., the reading of articles). In the case of facial masks, on the other hand, the exogenous factor of the worsening Covid-19 pandemic and the accompanying publicity was a driver of semantic dynamics irrespective of node position. Figure 5: Types of semantic diffusion in ArXiv (A) and Reddit (R). The figure shows for each time t j the probability that a node having the new sense for the first time is the neighbor of a node that already had it previously (walk, W) as opposed to cases where none of its neighbors had it previously (flight, F).

Exploration 2: Sentiment Analysis
As a second testbed, we apply dynamic contextualized word embeddings on a task for which social and temporal information is known to be important (Yang and Eisenstein, 2017): sentiment analysis. We use the Ciao and YELP datasets and train dynamic contextualized word embeddings by adding a two-layer feed-forward network on top of BERT BASE (uncased) and finetuning it for the task of sentiment classification. 16 We again compare  Dynamic contextualized word embeddings achieve slight but significant improvements over the already strong performance of non-dynamic BERT (Table 5). 17 This provides further evidence that infusing social and temporal information on the lexical level can be useful for NLP tasks.

Conclusion
We have introduced dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a PLM, specifically BERT, dynamic contextualized word embeddings model time and social space jointly, which makes them advantageous for various areas in NLP. We have trained dynamic contextualized word embeddings on four datasets and showed that they are capable of tracking social and temporal variability in word meaning. Besides serving as an analytical tool, dynamic contextualized word embeddings can also be of benefit for downstream tasks such as sentiment analysis.

A.1 Data Preprocessing
For each dataset, we remove duplicates as well as texts with less than 10 words. For the Ciao dataset, we further remove reviews rated as not helpful. We lowercase all words. Since BERT's input is limited to 512 tokens, we truncate longer texts by taking the first and last 256 tokens.
A.2 Embedding Training: Hyperparameters DCWE. The hyperparameters of the contextualizer are as for BERT BASE (uncased). In particular, the dimensionality of the input embeddingsẽ (k) is 768. For the dynamic component, the social vectors s ij ands i have a dimensionality of 50. The node2vec vectors for the initialization ofs i are trained on 10 sampled walks of length 80 per node with a window size of 2. The GAT has two layers with four attention heads, respectively (activation function: tanh). The feed-forward network has two layers (activation function: tanh). We apply dropout We use a batch size of 4 and perform grid search for the number of epochs n e ∈ {1, . . . , 7}, the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 }, and the regularization constant λ a ∈ {1 × 10 −2 , 1 × 10 −1 }, thereby also determining λ w (Section 3.4).
CWE. All hyperparameters are as for BERT BASE (uncased). The number of trainable parameters is 110,104,890. We use a batch size of 4 and perform grid search for the number of epochs n e ∈ {1, . . . , 7} and the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 }.
For both DCWE and CWE, we tune hyperparameters except for the number of epochs on the Ciao dataset (selection criterion: masked language modeling perplexity) and use the best configuration for ArXiv, Reddit, and YELP. Models are trained with categorical cross-entropy as the loss function and Adam (Kingma and Ba, 2015) as the optimizer. Experiments are performed on a GeForce GTX 1080 Ti GPU (11GB). Table 6 lists statistics of the validation performance over hyperparameter search trials and provides information about best hyperparameter configurations. 18 We also report the number of hyperparameter search trials as well as runtimes for the hyperparameter search.

A.3 Ablation Study: Hyperparameters
SA. Words are mapped to offsets using timespecific two-layer feed-forward networks (activation function: tanh). Both layers have a dimensionality of 768. All other hyperparameters are 18 Since expected validation performance (Dodge et al., 2019) may not be correct for grid search, we report mean and standard deviation of the performance instead. as for DCWE with a full dynamic component (Appendix A.2). The number of trainable parameters again varies between models trained on different datasets due to differences in |T | and is 133,728,570 for ArXiv, 124,279,098 for Ciao, 119,554,362 for Reddit, and 121,916,730 for YELP. We use a batch size of 4 and perform grid search for the number of epochs n e ∈ {1, . . . , 7}, the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 }, and the regularization constant λ a ∈ {1 × 10 −2 , 1 × 10 −1 }, thereby also determining λ w (Section 3.4).
TA. All hyperparameters are as for DCWE with a full dynamic component (Appendix A.2), with the difference that we only use one social component (consisting of a two-layer GAT and a two-layer feed-forward network) for all time units. The number of trainable parameters is 111,345,374. We use a batch size of 4 and perform grid search for the number of epochs n e ∈ {1, . . . , 7}, the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 }, and the regularization constant λ a ∈ {1 × 10 −2 , 1 × 10 −1 }.
For both SA and TA, we tune hyperparameters except for the number of epochs on the Ciao dataset (selection criterion: masked language modeling perplexity) and use the best configuration for ArXiv, Reddit, and YELP. Models are trained with categorical cross-entropy as the loss function and Adam as the optimizer. Experiments are performed on a GeForce GTX 1080 Ti GPU (11GB). Table 7 lists statistics of the validation performance over hyperparameter search trials and provides information about best hyperparameter configurations. We also report the number of hyperparameter search trials as well as runtimes for the hyperparameter search.
A.4 Sentiment Analysis: Hyperparameters DCWE. The mid layer of the feed-forward network on top of BERT has a dimensionality of 100. All other hyperparameters are as for DCWE trained on masked language modeling (Appendix A.2). Table 7: Validation performance statistics and hyperparameter search details for ablation study. SA: social ablation; TA: temporal ablation. The table shows the mean (µ) and standard deviation (σ) of the validation performance (masked language modeling perplexity) on all hyperparameter search trials and gives the number of epochs (n e ), learning rate (l), and regularization constant (λ a ) with the best validation performance as well as the runtime (τ ) in minutes for one full hyperparameter search (28 trials on Ciao, 7 trials on ArXiv, Reddit, and YELP).  The number of trainable parameters again varies between models trained on different datasets due to differences in |T | and is 124,445,049 for Ciao and 121,964,081 for YELP. We use a batch size of 4 and perform grid search for the number of epochs n e ∈ {1, . . . , 5}, the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 }, and the regularization constant λ a ∈ {1 × 10 −2 , 1 × 10 −1 }, thereby also determining λ w (Section 3.4). CWE. The mid layer of the feed-forward network on top of BERT has a dimensionality of 100. All other hyperparameters are as for BERT BASE (uncased). The number of trainable parameters is 109,559,241. We use a batch size of 4 and perform grid search for the number of epochs n e ∈ {1, . . . , 5} and the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 }.
For both DCWE and CWE, we tune hyperparameters except for the number of epochs on the Ciao dataset (selection criterion: F1 score) and use the best configuration for YELP. Models are trained with binary cross-entropy as the loss function and Adam as the optimizer. Experiments are performed on a GeForce GTX 1080 Ti GPU (11GB). Table 8 lists statistics of the validation performance over hyperparameter search trials and provides information about best hyperparameter con-figurations. We also report the number of hyperparameter search trials as well as runtimes for the hyperparameter search.