CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals

We propose a framework to modularize the training of neural language models that use diverse forms of context by eliminating the need to jointly train context and within-sentence encoders. Our approach, contextual universal embeddings (CUE), trains LMs on one type of contextual data and adapts to novel context types. The model consists of a pretrained neural sentence LM, a BERT-based contextual encoder, and a masked transfomer decoder that estimates LM probabilities using sentence-internal and contextual evidence.When contextually annotated data is unavailable, our model learns to combine contextual and sentence-internal information using noisy oracle unigram embeddings as a proxy. Real context data can be introduced later and used to adapt a small number of parameters that map contextual data into the decoder’s embedding space. We validate the CUE framework on a NYTimes text corpus with multiple metadata types, for which the LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context. Bootstrapping a contextual LM with only a subset of the metadata during training retains 85% of the achievable gain. Training the model initially with proxy context retains 67% of the perplexity gain after adapting to real context. Furthermore, we can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model. Overall, we obtain a modular framework that allows incremental, scalable training of context-enhanced LMs.


Introduction
Language models (LMs) estimate the prior probabilities of token sequences and are key probabilistic modeling components in a variety of applications, such as speech recognition, machine translation, or software keyboards. When modeling linguistic token sequences, typical LMs model one sentence or utterance at a time, reflecting the fact that the strongest predictors of words are syntactic constraints and semantic associations within the sentence. However, it has long been recognized that context beyond the sentence has substantial influence on the word probabilities within a sentence. Context literally means the surrounding text (or preceding text, when predicting words in temporal order), but can also refer to any extra-linguistic information, such as metadata (e.g., authorship, time, location) or associated other modalities (e.g., visual cues associated with a spoken utterance).
There is a large literature on leveraging such contextual information for language modeling, some of which we review below (Section 2). However, including context in language modeling presents major challenges for operational settings, especially when LMs need to be trained and deployed at scale: • Context data is hard to come by. Many language corpora have no or very limited metadata, or contain unordered sentences that do not provide sequential context.
• Context types are specific to a given source.
A newspaper corpus has metadata that is very different from spoken language data.
• Use of context renders models contextspecific, and therefore, less universally applicable. With each type of context, a new model, or even model architecture, is required.
• Context modeling requires more parameters, compute complexity and more training data.
All these difficulties lead to context being used sparingly in most practical settings, and only when it yields substantial benefits (such as in using a user's personal contact list in voice dialing).
In this paper, we propose a modular modeling framework for contextual language models, called contextual universal embeddings (CUE). The fundamental idea is to separate the modeling of (1) sentence-internal LM, (2) context embedding and (3) combination of sentence-internal and contextual information each into their own modules. First of all, we show that this architecture is an effective way to bring context to bear on the LM task, achieving 25% relative perplexity reduction over a sentence-internal model, on a corpus of newspaper articles with rich metadata. More importantly, each module can be trained separately, as opposed to jointly with the other modules. Through experimentation we show that, for the practically important use-cases, training modules separately or incrementally preserves most of the achievable gain from contextual information.
Specifically, we can replace one type of context with another, while only adapting the context encoders to the new context, and retain 85% of the best-case perplexity gain. Maybe more surprisingly, we can train the decoder that combines context and sentence-internal information without any actual context, instead using noisy oracle unigram embeddings as a proxy. This recovers 67% of the best-case gain after adapting to real context. (Adapting the context encoders affects much fewer parameters, and takes much less data, than the model overall.) Finally, we show that context encoders can be frozen and a whole different sentence-LM architecture swapped into the model ensemble. After adapting only the combiner-decoder we obtain perplexity gains close to the optimum that would have been achieved by joint training of combiner and context embedding.

Prior Work
Longer text history is the most commonly used context in language models (LMs) (Mikolov and Zweig, 2012;Jaech and Ostendorf, 2018a;Ji et al., 2015;Lin et al., 2015). A naive way to bias a LM over text history is to ignore the sentence boundaries and train the contextual LM as the standard neural LM (Ji et al., 2015). However, recurrent neural networks suffer training difficulties on longer sequences (Bengio et al., 1994) while transformerstyle models are effective at incorporating this extra information (Dai et al., 2019;Brown et al., 2020).
Another approach is to summarize context into a single context embedding using a separate model. For example, Mikolov and Zweig (2012) and Le and Mikolov (2014) use topic information extracted from the context. Mikolov and Zweig (2012) use a pretrained Latent Dirichlet Allocation (LDA) model while Le and Mikolov (2014) use paragraph embeddings learned during LM training. Wang and Cho (2016) on the other hand, use a bag-of-words of whole text or individual sentences in the context to build the context vector. Roh et al. (2020) and Lin et al. (2015) further extend sentence-based contextual models by using hierarchical embedding techniques. This approach learns a representation of the context that is directly used as input to a neural LM.
Other sequence tasks in natural language processing (NLP) also leverage contextual information. Neural machine translation (NMT) capitalizes on the availability of previous sentences on the source and target sides when translating documents (Yun et al., 2020;Sugiyama and Yoshinaga, 2021;. The only difference in the approaches is how the context is encoded into a representation optimized for NMT. Automatic speech recognition (ASR) and conversational dialog systems also use contextual information, such as recent advances in shallow or deep fusion of end-to-end neural architectures (Zhao et al., 2019;Williams et al., 2018;Kim and Metze, 2018;Munkhdalai et al., 2021;Jain et al., 2020). Recent papers have also considered biasing LMs with context beyond the previous sentence and incorporate additional signals such as date-time, geolocation or gender (Ma et al., 2018;Diehl Martinez et al., 2021) or application metadata like dialog act or intent (Masumura et al., 2019;Shenoy et al., 2021;Liu and Lane, 2017). Other sources of context used to bias LMs are personalized content (Jaech and Ostendorf, 2018b;Fiorini and Lu, 2018); conversational turn-taking (Xiong et al., 2018); multi-modal sources (Moriya and Jones, 2018); or even user demographics to suggest fashion suggestions (Denk and Peleteiro Ramallo, 2020).

Architecture
Our task is to estimate an auto-regressive language model conditioned not only on the previous words in the sentence, W = w 1 , w 2 , . . . , w n , but also on several contextual signals, where C = [c 1 , ..., c K ] represents a set of K contextual inputs that may vary with each sentence, and where each c k is a sequence of tokens from the same vocabulary as the target sentence.
Adapting recent work in hierarchical contextual embeddings (Yun et al., 2020), our CUE architecture has three components (see Figure 1): • An auto-regressive transformer sentence encoder conditioned on within-sentence history.
• A transformer context encoder with a BERTbased embedding combines multiple context signals into one CUE vector.
• An auto-regressive transformer decoder to predict the current word based on the context and within-sentence embeddings.
Our goal is to train a separate context encoder that can be updated without modifying the sentence encoder or decoder and run inference independently of the other modules. This is important for operational practicalities, to precompute context embeddings and update them incrementally without requiring downstream components to change.

Context Encoder
We present all the contextual signals to the context encoder as strings. Non-textual signals, like datetime, are converted into English, such as "Wednesday 29 May 1985". Similarly, we represent any categorical or symbolic context by its English text string.
The encoder projects the set C of K context types into one compact embedding. These may consist of the previous K sentences, K different metadata types such as datetime, or a mix of both.
We then encode each context string c k with Dis-tilBERT (Sanh et al., 2019) and represent each context by its CLS embedding to generate the intermediate representation ( 2) We do not fine-tune DistilBERT since empirically it gave negligible gains on our experimental corpora. The set of intermediate representations G = [g 1 ; . . . ; g K ] is then passed through transformer blocks to learn dependencies between the context types and to generate the self-attended embeddings The contextual encoder is invariant to the ordering of context types since we treat context as a "bag of sequences" and do not add positional embedding to the CLS embeddings. Additionally, we do not use query values from the within-sentence encoder, so as to preserve the modularity of our architecture; our goal is to use one context encoder with multiple sentence encoders or modeling tasks. The empirical gain was small for conditioning the context attention on the history at each word position (thus giving a different context vector for each token). Finally, the per-context embeddings e k are averaged to produce our compact representation,

Sentence Encoder
The sentence encoder is a familiar auto-regressive masked language model with a transformer encoder and a final softmax layer to generate a distribution over the vocabulary (Vaswani et al., 2017). We used six layers of 512 dimensions each with 4 attention heads and used a standard language modeling task to fit the parameters; no context was used to train this module. In our experiments, the sentence-encoder parameters are frozen and never fine-tuned when biasing the decoder with context. We assume that the sentence encoder was trained on a very large general text corpus. It uses the same DistilBERT tokenizer as the context encoder, but do not use DistilBERT for word embeddings since our model is causally auto-regressive.

Decoder
The decoder is a masked transformer decoder as described in Vaswani et al. (2017) with six layers of  512 dimensions and 4 heads for multi-headed attention. The sentence-internal embeddings (before the softmax layer) are passed as the shifted outputs to the decoder; along with the contextual CUE vector e cue as input to the multi-headed attention module in the decoder.

Adapting to Evolving Context
We now no longer assume that the set of context types is static between training and test. For example, an API providing context may be retired; or business rules improving customer privacy may remove geographic information. The set of contexts may evolve over the life-cycle of our CUE encoder and we now introduce an adaptation step. 1 Ideally, we would jointly fine-tune the entire model architecture (context encoder, sentence encoder and decoder) on annotated data that contains the new context types. However, this creates an operational burden since different downstream decoders that use context embeddings would each need retraining. Our goal is to adapt the CUE context encoder while leaving the decoder parameters frozen. This will minimize the number of parameters to be retrained and simplify model deployment by factoring the context encoder from the decoder.
We break the training process into two phases: Training constructs the initial set of model parameters and is not constrained by operational needs. Adaptation happens at some later point in time after the set of context types changes. Section 4.1 considers the scenario where new context types are added to or replace the initial training types. Section 4.2 assumes no context is available during model training, only at adaptation time.

Adapting with annotated data
Our adaptation strategy is to fine-tune only the context encoder, leaving the other parameters unchanged. Since the context encoder consumes sequences of text, our approach benefits from Distil-1 Out of scope for this paper is missing context at inference time. We assume the same set of contexts at adaptation and testing time. We add noise to an embedding of the target sentence unigram distribution as a proxy for the decoder to learn to attend to context as yet unknown during training. Modules in gray are frozen during decoder training.
BERT projecting sentences into a shared embedding space through the CLS token prepended to the beginning of the sentence. We fine-tune the context transformer that operates on the per-context Dis-tilBERT embeddings before averaging (see Figure  1). This component has 6.6M parameters, roughly 5% of the total number of parameters (see Table  1). Given an adaptation corpus of sentences paired with new context types, we take a forward pass for each batch and then backpropagate through the decoder to the contextual encoder. This approach handles any arrangement of new context. Section 6.3 details results evaluating how adaptation benefits over zero-shot approaches with a static model.

Proxy embeddings
We now consider the scenario where we have no sentences paired with context during training, but want to bias our architecture on context at adaptation time. Since the decoder parameters are frozen during adaptation, we must prime it during training to pay attention to a context embedding, even though we lack the context to generate such an embedding. We tackle this problem by hypothesizing that context plays a role similar to a topic model: it mostly affects the unigram distribution, with small higher-order effects. Thus, we generate proxy embeddings from an oracle encoding the unigrams in the target sentences, described below.

Generate unigram embeddings
We first transform each sentence W = w 1 , . . . , w n in our training corpus D into an empirical unigram distribution ("bag of words") over the vocab V, . (5) Next, a feed-forward auto-encoder, F Θ , recon-structsP (W ) through a low-dimensional hidden layer fitted by minimizing reconstruction loss with Kullback-Leibler (KL) divergence, The layers were 28996×128×16×128×28996 with ReLU non-linearities, and a final softmax layer to generate a distribution over the vocabulary. We swept multiple architecture sizes and saw no gain for more parameters. Reconstruction loss on the test set improved from 5.59 to 1.94 after ten epochs.

Train decoder with proxy embeddings
We then replace the context encoder with this autoencoder; freeze the sentence encoder and fit the parameters of the decoder on training data that do not have context annotations (see Figure 2). In place of the context embedding, we construct a proxy embeddingâ by adding Gaussian noise to the embedding of the entire sentence and re-normalizing.
a =â â 2 , As we increase σ, the information content inâ decreases, calibrating the information content of the proxy embeddings to match the expected strength of the actual context. Section 6.3 details the importance of this hyperparameter. We then project this low-dimensional embedding to the target contextual embedding (512 in our experiments) through a linear projection and pass it as input to the decoder.

Adapting the context encoder
Once annotated sentences with context are available for adaptation, we train only the context encoder to project available context into an embedding space tuned to the decoder. The decoder was "primed" to attend to an external embedding and the sentence-internal embeddings. We freeze the decoder and sentence encoder weights; backpropagate; and update only the weights of the transformer in the context encoder and the linear projections that scale from DistilBERT embeddings to  Table 2: NYTimes corpora used in this work. We randomly shuffle all articles before partitioning and use 20% of the entire corpus to reduce experiment turnaround time.
the context embedding dimension. As mentioned above, our encoder is agnostic to the ordering of the context types and transforms text into an embedding through DistilBERT. Section 6 demonstrates that this approach successfully adapts to unseen context data.

Corpus
We used the New York Times Annotated Corpus (Sandhaus, 2008) released through Linguistic Data Consortium (catalog number LDC2008T19) containing over 1.8M English articles spanning 1987 to 2007. This corpus includes a rich collection of contextual annotations for each article, ideal for evaluating our CUE framework. Each article contains up to 47 different metadata types that were labeled by humans (author, title, desk) or algorithmically (locations, topic). We down-selected from 47 to 11 distinct metadata signals after removing redundant or uninformative context. All context types were character sequences and include previous sentence, title, author, entities present in the article, section descriptors, date, and topic descriptors (see Appendix A for details). Articles averaged 32 sentences in length and average sentence length (after tokenization) was 26. We trained the sentence encoder on a large subset of articles; used separate training and adaptation corpora and separate validation and test sets (Table 2).
6 Experimental Results

Hyper-parameters
Sentences were tokenized first with spaCy (Honnibal and Montani, 2017) and then into word pieces using the DistilBERT tokenizer. We evaluated model performance by computing perplexity (PPL) on the heldout test set. We trained all models for ten epochs using the AdamW (Loshchilov and Hutter, 2017) Table 3: Reduction in PPL by adding context. We contrast a sentence-internal transformer LM with four variations of added contextual information. Article metadata (e.g., author, title) is mildly informative, the previous sentence is the most useful. Metadata improves PPL more when previous sentence is included.
ping of 0.95. We parallelized batches on 8 V100 GPUs and averaged 75k tokens per second with a per-GPU batch size varying between 64 and 256 sentences depending on the architecture size. The parameters of the sentence encoder and DistilBERT are frozen for all the experiments, greatly speeding up training time with negligible impact on PPL.

Contextual biasing
We first compare our architecture against a sentence-internal auto-regressive language model. The 6x512, 4-head transformer word language model was trained on the separate 200M-word corpus and used as the sentence-encoder in our full CUE framework. As shown in Table 3, contextual signals reduce PPL by 25% for this corpus and nearly three fourths of that gain is due to the previous sentence. Since the remaining contextual features are at the article level, they have a smaller impact on within-sentence likelihoods. 2 This 25% relative gain is the upper bound for adaptation methods since context types are consistent between training and test; and context encoder and decoder are trained jointly.
To evaluate the key elements of our architecture, we conducted an ablation study by disabling various components and measured the relative degradation in perplexity, as shown in Table 4. Removing the transformer after DistilBERT embeddings and using a simple average gives an 8% degradation. Removing the transformer decoder and instead concatenating the CUE vector with each step's hidden state before the logit layer gives a 22% degradation. Replacing DistilBERT with a randomly initialized transformer estimated on the contextual training corpus give the biggest loss of 27%. Finally, using only the context to predict each word (a constant vector at each step) is much worse, but still 45x better than random (which would be equal to the 2 See Appendix B for a breakdown of the relative strength of each contextual type.  Table 4: Ablation study on architecture modules. Refer to Figure 1 for a schematic of the components. vocab size of 28,996). Context is a useful prior, even though it is constant for all tokens in the sentence. Figure 3 captures the model's attention converging to the relative importance of each context type. The ordering of context types by attention weights is similar to a ranking by perplexity gain given in Appendix B.

Adaptation
We next evaluate our framework for interchangeability of different forms of context. We randomly partitioned the eleven context types into two sets A and B and report the average over five separate trials in Table 5. Set A is our training set and we experiment with two adaptation scenarios: B replaces A or B is added to A.
When adding additional context types (A → A+B), adapting the context encoder without retraining the decoder captures 85% of the possible gain for jointly training the context encoder and de-  Table 5: Adaptation results. Results are averaged over five random partitions of context types into training set A and adaptation set B. Results without std. dev. are based on a single experiment run. Adding metadata with CUE embeddings outperforms a word-only model (row 1) by 25% (row 11). CUE vectors are robust to evolving context, either without any context in training (rows 4, 8); no adaptation (rows 5, 9); or adapting with new annotated sentences (rows 6, 10). Contrast with lower bound of all context available in training and adaptation (rows 7,11).
coder on all context types (compare rows 1, 10 and 11). Starting at a word-only PPL of 36.6, adaptation to A+B reaches 28.8 versus the lower bound of 27.4. When replacing context types (A → B), adaptation also achieves 85% of the possible gain (36.6 to 30.9 versus the lower bound of 29.9 in rows 1, 6 and 7). Even without any adaptation (rows 5 and 9), our architecture generalizes to new context types (approximately 70% of the possible gain), though not as effectively as with adaptation. This is because we transform context into English text and leverage BERT embeddings as a strong initial embedding for context sequences. When we train the decoder with proxy embeddings (no real context at all) and adapt to context, the PPL is within 6% to 11% (depending on the context subset) of the lower bound of jointly training the context encoder and decoder. This approach recovers 67% of the gain from jointly training context encoder and decoder for the two scenarios (A → B and A → A+B). We find this quite remarkable given that the decoder knows nothing of context during training; the result validates our hypothesis that context encodes the unigram priors.
We tuned the strength of the proxy embedding by sweeping the variance of Gaussian noise added. The sweet spot is where the information content in the proxy embedding is close to the actual context, as shown in Figure 4. This intuitive result provides a sensible recipe for setting this hyperparameter in a production setting.

Different sentence encoders
Our CUE architecture factors the context encoder from the sentence encoder and decoder. This approach generates one embedding that can be used with multiple decoders and sentence encoder pairs. We sweep the amount of noise added to the oracle unigram vector on the x-axis. When training and testing on only the unigram vector (blue line) the unigram vector is a powerful oracle without noise, but then becomes random as the variance increases. During adaptation (orange line), we discard the unigram embeddings, freeze the decoder parameters, and retrain the context encoder (5% of parameters). The amount of embedding noise is optimal roughly when the proxy embedding is as informative as actual context (where blue and red lines intersect).
To evaluate the generalizability of our CUE framework, we trained a 4x512 LSTM sentence encoder and froze its model parameters for the remaining experiments. We then trained a new decoder using the LSTM sentence encoder and evaluated two different context encoders: 1) randomly initialized and jointly trained with the decoder or 2) the pretrained encoder jointly trained with the old, transformer-based sentence encoder and frozen parameters. Table 6 shows that the CUE vectors trained with one sentence-LM architecture are useful to the other, with a relative degradation when swapping of 7% and 1%, respectively, between LSTM and transformer sentence encoders.  These results suggest that our CUE framework can factor the context encoder and decoder training and generalize to multiple decoder architectures. This frequently occurs in operational settings such as the first and second pass LMs in speech recognition or compressed parameter sizes due to memory and latency constraints.

Discussion
We analyzed whether a sentence's context behaves like a cache model, since it contains textual data from the previous sentence, title, and other contexts. To better understand this effect, we divided test data tokens into two bins: Those that appeared in the text of the sentence's context and those that did not. We measured the relative gain in log likelihood when conditioning the LM on context versus not. 30% of the tokens appeared in the context (cache) and the relative gain was 74%-there clearly is a strong benefit for recurring tokens and the CUE encoders capture this effect. The 70% of tokens that do not occur in the cache improved their log likelihood by 26% relative. So the cache effect does not explain the entire benefit of CUE vectors and correlations among different token types are captured as well. The top context types that had tokens in the sentence were previous sentence (23%), title (9%) and person (6%).
To verify that the empirical improvements from the previous sections are semantically plausible, we analyzed the context embeddings of the first sentence of 5000 heldout articles. These embeddings do not contain information from the previous sentence and thus represent the entire article's metadata. Figure 6 projects these embeddings down to two dimensions with t-SNE. We then clustered the vectors with k-means and aggregated word counts for all articles within a cluster.
Finally, we display the five most salient words (computed with TF-IDF) from the context and, separately, from the article text. Even though the articles were clustered based only on context, the groupings of article text are semantically meaningful, with clear clusters related to newspaper sections such as corrections, marriage announcements, sports and other news related topics. Our context embedding is preserving semantic information.
One limitation of the proxy embedding approach is that they may not extend to other NLP tasks, like named entity tagging. Since they are derived from the unigram embedding, they directly encode the targets of the language model task. This may not prove useful for higher-order annotations and further work should look into a multi-task proxy embedding that directly optimizes an "interface" embedding space instead of a unigram distribution.

Conclusions
We introduced the CUE framework to factor context encoding and next word prediction of contextaware neural language models. Unlike previous work, we do not assume that the set of context signals is constant between training and test. We optimize the model architecture to reduce the operational burden of managing and retraining of large neural LMs over their life cycle.
Our approach is robust to changing context types; by adapting only 5% of the parameters, we recover 85% of the possible gain from jointly training all components. Furthermore, we introduce proxy embeddings to pretrain a decoder to be attuned to external context embeddings even when those are not known at training time. This approach is 67% as good as jointly training with all context.
In future work, we would like to handle missing context at inference time through data imputation or dropout approaches. Furthermore, we plan to extend the proxy embedding approach such that the context encoders can be trained fully independent of the decoder.  C Appendix: Qualitative Visualization Figure 6: T-SNE plot of context embeddings. We cluster the first sentence embedding of 5000 articles and project the 512-d context vectors to two dimensions with t-SNE. We group context vectors into clusters with k-means and compute TF-IDF scores separately for context (green) and sentence (blue) words and show the top 5 for each. Notice how the set of five green words cohere with the five blue words, indicating the CUE embeddings project context and metadata to a similar space as the article contents. The clustering recovers meaningful news topics, such as company earning reports, obituaries, sports, books and art.