HistAlign: Improving Context Dependency in Language Generation by Aligning with History

Language models (LMs) can generate hallucinations and incoherent outputs, which highlights their weak context dependency. Cache-LMs, which augment LMs with a memory of recent history, can increase context dependency and have shown remarkable performance in diverse language generation tasks. However, we find that even with training, the performance gain stemming from the cache component of current cache-LMs is suboptimal due to the misalignment between the current hidden states and those stored in the memory. In this work, we present HistAlign, a new training approach to ensure good cache alignment such that the model receives useful signals from the history. We first prove our concept on a simple and synthetic task where the memory is essential for correct predictions, and we show that the cache component of HistAlign is better aligned and improves overall performance. Next, we evaluate HistAlign on diverse downstream language generation tasks, including prompt continuation, abstractive summarization, and data-to-text. We demonstrate that HistAlign improves text coherence and faithfulness in open-ended and conditional generation settings respectively. HistAlign is also generalizable across different model families, showcasing its strength in improving context dependency of LMs in diverse scenarios. Our code is publicly available at https://github.com/meetdavidwan/histalign


Introduction
Language modeling (LM), or language generation, requires decent context dependency.For both openended and conditional generation tasks, we want the model generation to be consistent with its previous generation or the input context.However, incoherence and hallucination problems are pervasive in current model generations (Holtzman et al.,   1 Our code is publicly available at https://github.com/meetdavidwan/histalign  Callum (2022).Our HISTALIGN is able to assign high probabilities to both king and woman, and thus is able to tune down the weight of the hallucinated token queen from the softmax probability.Current cache language models (baseline) give high probabilities to irrelevant tokens in the cache and thus are at risk of producing hallucinated or incoherent tokens.
Cache language model (Grave et al., 2017b, Cache-LM) is a simple yet effective method to improve context dependency by equipping LM with an additional memory of recent history (local context) and enabling it to directly "copy" from the history.Such models showed considerable improvement in language modeling and downstream generation tasks (Merity et al., 2017;See et al., 2017).However, since the introduction of Transformers (Vaswani et al., 2017), local memory has been less used due to the powerful self-attention mechanism, and more works have been focusing on leveraging long-term or external memory (Khandelwal et al., 2020;Yogatama et al., 2021).Nonethe-less, Zhong et al. (2022) showed that using local memory on top of a Transformer is still beneficial.
In this paper, we focus on applying local cache to Transformer-based LMs and show that better alignment of the cache component leads to stronger gains.First, we show that cache-LM theoretically breaks the softmax bottleneck (Yang et al., 2018) that limits the capacity of any parametric LM to model highly context-dependent natural language.Then, we find that, in current cache-LMs, the signals provided by the memory component are minor, even when using the cache component during training (Zhong et al., 2022).We hypothesize that the main bottleneck comes from the misalignment of the current hidden states and those in the memory, because of which more relevant memories are not given higher weights than less relevant ones.We demonstrate this problem through a synthetic task: Ambiguous Template (Chang and McCallum, 2022), an example of which is shown in Figure 1.When asking the model to predict the next word given the context "After debating whether to bow to the woman or the king, the jester decided to bow to the __ ," current cache-LM does not give the highest probabilities to the desired words king and woman.Instead, we find that irrelevant words, such as to and jester have high cache probabilities.When combining these probabilities with the original softmax, the desired words cannot be ranked as top tokens.We find that this problem exists in pretrained LMs of various sizes, fine-tuned models, as well as models with cache augmented.
Next, we address this misalignment issue by proposing a new fine-tuning scheme, HISTALIGN, in which we augment the LM training objective with a contrastive loss to encourage the model to align the current hidden states with those in the history.As shown in Figure 1, our cache component gives higher probabilities for king and woman than other less relevant words in the cache.Unlike the typical contrastive loss that treats all negative examples equally, we propose to learn a ranking of negative tokens, i.e., more semantically similar tokens are ranked higher.As shown in Figure 2, when we align the space for the token housing, we want words such as accommodations to be closer than less relevant words like children.Hence, the cache can also be useful even when the exact target word is not present in the history.We demonstrate the stronger cache performance of HISTALIGN through the synthetic ambiguous template task and showcase its strength in improving coherence for open-ended prompt continuation and faithfulness for abstractive summarization and data-to-text.
To summarize, our contributions are as follows: • We discuss why cache-LM with local memory can improve context dependency through a softmax bottleneck lens.
• We show the misalignment problem present in current cache language models and their training strategy.
• We propose a new training method, HISTAL-IGN, based on order-informed contrastive learning, which alleviates the misalignment problem and makes better use of memories.
• We demonstrate that HISTALIGN improves the coherence of open-ended generation as well as the faithfulness of conditional generation, and it works across different model families and adds little computational overhead.

Related Work
Cache-LM and Pointer Network.Adding a cache component to a language model (LM) was first introduced for speech recognition (Kuhn and De Mori, 1990).Grave et al. (2017c) extended this idea to RNN-based neural LM, which they call neural cache-LM.Cache-LM predicts the next token by combining the RNN model's outputs with the similarities between the cache and the current hidden state.The cache saves tuples of hidden state and next token prediction, i.e., (h i , x i+1 ), from recent history (see Section 3.2).Essentially, the cache component enables the model to copy tokens from the history.Similar to cache-LM, a pointer network (Vinyals et al., 2015;Merity et al., 2017) also combines generating and copying of tokens but uses h i as a representation of x i (instead of x i+1 ).This means that a pointer network requires learning additional transformations between the current representation and those in the past and a gating component for interpolation (Merity et al., 2017;See et al., 2017). 2 In contrast, cache-LM doesn't need extra parameters to be learned and can be applied directly at testing time.It is more efficient to be used for larger cache sizes (i.e., extending cache-LM to long-term and external memory), and has been shown to perform better than pointer-network (Grave et al., 2017b;Zhong et al., 2022).While cache-LM can be directly applied at test time, a recent work (Zhong et al., 2022) showed that it leads to more improvement when using cache during training time as well.Nonetheless, such proposed learning objectives for cache-LMs usually only provide distant supervision to the cache component.In contrast, we introduce direct supervision to the cache, which aligns the current representation with its history.
LM with Local or External Memory.Cache-LM and pointer network were originally proposed to only use hidden states from the local context, i.e., previous tokens in the input context.Though this technique has been proven to be helpful for language modeling and other language generation tasks (Gulcehre et al., 2016;Grave et al., 2017c;Merity et al., 2017;See et al., 2017), it has been less used after the Transformer architecture became popular, because the self-attention mechanism can attend to any token in the input context.Therefore, many works (Grave et al., 2017a;Khandelwal et al., 2020;Yogatama et al., 2021;Zhong et al., 2022;Min et al., 2022) proposed to use long-term or external memory beyond local context by applying retrieval techniques.Though our work can be extended to the external cache setting, we focus only on incorporating local memory, and we show that local memory is still helpful on top of Transformer because it breaks the softmax bottleneck (Yang et al., 2018) of parametric language models.A concurrent work (Chang et al., 2023) also demonstrates how a pointer network breaks softmax bottleneck by examples and empirical results, while we discuss this in a more mathematical way in Section 4.1.

Context Dependency in Language Generation.
Existing language generation models demonstrate weak context dependency.For open-ended generation tasks, Holtzman et al. (2020) pointed out that strong LMs can produce very incoherent text following an input prompt.This incoherence issue has also been long observed in the story generation literature (Rashkin et al., 2020;Alabdulkarim et al., 2021).For conditional generation tasks, for example, summarization, Cao et al. (2018); Maynez et al. (2020) showed that around 30% and 70% model-generated summaries contain hallucinations for two popularly used summarization datasets, respectively.Similar unfaithfulness problems have also been seen in data-to-text generation (Chen et al., 2020a), machine translation (Weng et al., 2020), etc.Though many approaches have been introduced to alleviate incoherence (Li et al., 2022a) or unfaithfulness (Cao and Wang, 2021;Wan and Bansal, 2022), in this work, we explore a simple yet general cache-LM method to increase context dependency for diverse tasks.The concurrent work (Chang et al., 2023) uses pointer network type of architectures to improve next-word distribution and summarization factuality.They modify the softmax head by using additional contextdependent embeddings.In contrast, we simply apply the original cache-LM architecture and improve it with a novel training objective.

Language Modeling
We focus on autoregressive language modeling (LM).Here, for simplicity, we assume that the LM is decoder-only, i.e., the context of the current step is the generated tokens of previous steps.We show that the same approach can easily be generalized to encoder-decoder models in Section 4.3.Given the context c t = x 1 , ..., x t−1 , the probability of next token x t = w is predicted by a softmax head: where e w is the output embedding of token w and h t is the output context vector (hidden state) from the model at the t-th step.The model is trained by minimizing the cross-entropy loss: l xe = − t log P lm (x t |c t ).

Cache Language Models
Cache language models augment a memory component to language models.Following Grave et al. (2017c), we consider cache to be a list of tuples of context vector and target token, (h i , x i ).Assume we only consider the history of the local context, then the local memory of t-th step is written as: Then, the next-token prediction aggregates the logits from the softmax head and the similarities be-tween h t and those saved in the memory: where sim(•, •) can be an arbitrary similarity function.Here, we follow Zhong et al. (2022) and use the scaled dot product: sim(h , where d is the hidden dimension size. While Grave et al. (2017c) only incorporated cache during evaluation, TRIME (Zhong et al., 2022) showed that it brings more benefits when also incorporated during training, i.e., minimizing l trime = − t log P clm (x t |c t ).Here, we also use cache in both training and evaluation, but we improve the training objective by introducing direct supervision on the cache (see Section 4.2).

Breaking Softmax Bottleneck
We first want to connect using local memory with the softmax bottleneck problem (Yang et al., 2018) and show that Transformer's self-attention cannot break this bottleneck, while the local cache can.
Parametric autoregressive language models (Section 3.1), including Transformer-based LMs, use a softmax function operating on context vectors (or hidden states) H ∈ R N ×d and output embedding matrix E ∈ R V ×d .N is the number of contexts, assuming every token in the training set has a different context, then N is the number of tokens in the training set.V is the vocabulary size, and d is the hidden dimension size.Then, the next token probabilities form a log-probability matrix A ∈ R N ×V (A tw = log P (w|h t )).Ideally, since every context is unique, the rank of A should be as large as V (assuming V < N ).However, as A is roughly equivalent to HE ⊤ , its rank is strictly upper bounded by hidden size d (please refer to Yang et al. (2018) for the formal proof).This low-rank problem greatly limits the LM's capacity to model highly contextdependent natural language.This can be seen in Figure 1, where queen achieves higher probability than woman.The reason for LM's difficulty in such bimodal distribution, as explained in Chang and McCallum (2022), is that the four words king, woman, man, queen tend to form a parallelogram in the embedding space, and if the model's hidden state wishes to be close to the output embeddings of king and woman, it will also be close to those of man and queen.
To break this bottleneck, one simple solution is to increase d, as we see larger models usually have better performance.Another solution proposed by Yang et al. (2018) and extended by Kanai et al. (2018); Yang et al. (2019); Chang and Mc-Callum (2022) is to use multiple softmax heads -mixture of softmax (MoS), e.g., P (w|h t ) ∝ exp(h t is a different context vector.However, adding softmax heads is fairly computationally expensive.Comparing MoS to Eq. 3, we can see that adding exp(h ⊤ t e w ) and exp(sim(h t , h i )) resembles MoS without adding extra softmax heads.Another way to understand this connection is that when using local memory, A is roughly equivalent to , where H c are the hidden states in the local context. 3Assuming Hence, the rank of A is no longer upper bounded by d.Note that this connection also holds for using long-term or external memories.

HISTALIGN
Cache-LM combines the original softmax probabilities with the cache probabilities by aggregating the similarity scores between the current hidden state and those in the cache.To use the cache module effectively, the similarity function sim(•, •) plays an important role in Eq. 3. If the similarities between the current hidden state and less relevant memories are higher than more relevant ones, it would steer the model away from selecting the most useful information from the cache.By assigning a high probability to the correct local memories, e.g., those corresponding to king and woman in the example of Figure 1, we can ensure that when the probabilities are combined, they will be scored higher than irrelevant and hallucinated tokens.However, we find that even when directly maximizing log P clm (Zhong et al., 2022), there is no guarantee that the current representations are well aligned with relevant information stored in the memory, as shown by the baseline probabilities in Figure 1 (see Section 6.1 for more details).
Hence, to deal with this misalignment, we pro-  We first get local cache by combining the hidden states in local context with their target tokens, and then rank them according to embedding similarity.The ranked memories are then used to train with the margin loss.This ensures that negative yet similar words (e.g.accommodations) will be closer in the vector space than irrelevant words (e.g.children).
pose a new contrastive objective that encourages higher similarities between the hidden states of similar target tokens.During training, given the current hidden state h t and the corresponding next token x t , we construct a positive set P t from caches by selecting memories with the same target token: All other memories are taken as negative examples.
An example is shown in step 2 of Figure 2.For predicting the token housing, we have two previous mentions of the word housing, and the other words, including flat, children, accommodations, etc., are considered as negative.
In the typical contrastive loss, such as InfoNCE (van den Oord et al., 2019), all negative examples are treated equally.However, we hope to learn an ordering of the negative examples -more similar examples are ranked higher than less similar ones.In the example in Figure 2, accommodations is more similar to housing than children.This ensures that even when predicting words that do not have previous mentions in the local cache, our model can still output a reasonable alternative.
To achieve this, we construct a ranking of memories by computing the cosine similarities between the embedding of the current target word and the embeddings of words in the cache, i.e., cosim(e t , e i ).After sorting tokens from the most similar w.r.t.semantic similarity to the least, we use the following max-margin loss (Liu et al., 2022c): where λ i,j = (j − i)λ, and λ is the margin tuned based on validation loss.
The final objective of HISTALIGN is a combination of the original LM cross-entropy loss l xe and this ranking-based contrastive loss: where α is a tunable weight of the contrastive loss.Note that during the inference time, we use Eq. 3.

Extension to Encoder-Decoder Models
HISTALIGN can be easily adapted to encoderdecoder models.For conditional generation tasks, the target text is usually short, hence, coherence is not a big issue.What is more crucial is whether the target generation stays true to the input context, e.g., the input document for summarization or the input table for data-to-text.Therefore, we define the local cache to be the input tokens and their corresponding encoder hidden states, as opposed to the output tokens and decoder hidden states for decoder-only models.We then calculate the similarity between the current decoder hidden state with those encoder hidden states stored in the cache.

Experimental Setup
Here, we describe the tasks and the experimental setups.Please refer to Appendix A for more details.

Tasks and Datasets
Ambiguous Template is a useful synthetic dataset collated by Chang and McCallum (2022), in which each example is generated using templates with diagonal words4 from semantic analogy relations in the Google (English) analogy dataset (Mikolov et al., 2013).This is a simple yet effective setting to examine whether the model can copy the correct tokens from history and not hallucinate semantically similar tokens, e.g., queen and man of the example in Figure 1.Since the target words can always be found in the context, we can also evaluate the performance only with the cache component.
Open-Ended Generation evaluates the language modeling capability by asking the model to generate a continuation given a prompt (Holtzman et al., 2020;Su et al., 2022;Li et al., 2022b).We use WritingPrompts (Fan et al., 2018), and treat the first 50 tokens as the prompt and allow the model to generate up to 256 tokens using the canonical nucleus sampling (p = 0.95) (Holtzman et al., 2020).
Abstractive Summarization is the task of providing an abridged version of the input document.One crucial problem is 'hallucination', where the generated summaries contain facts or entities that are wrong or not present in the document (Cao et al., 2018;Maynez et al., 2020).We evaluate on two widely-used English News summarization datasets, XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015).
Data-to-Text is the task of describing structured data, where faithfulness is extremely important, as humans do not tolerate any hallucinations in cases such as describing medical reports or financial statistics (Thomson and Reiter, 2020).We evaluate on LogicNLG (Chen et al., 2020a).

Systems
We use GPT2-small and GPT2-large (Radford et al., 2019) for ambiguous template and prompt continuation, and we use BART-large (Lewis et al., 2020) for both summarization and data-to-text.For all tasks, we choose to finetune pre-trained LMs.The first baseline we compare to is fine-tuning with the original cross-entropy loss (l xe in Section 3.1), which is named by the original model name in our result tables.Then, we also compare to the most recent cache-LM learning objective, TRIME (Zhong et al., 2022) (l trime in Section 3.2).

Evaluations
Ambiguous Template.As a proof-of-concept experiment, we evaluate under both a full setting, using the combined probability in Eq. 3, as well as a cache-only setting, only using the cache similarity scores to predict the next token.We evaluate the performance via the accuracy of having the two diagonal words within the top-k predictions (Acc@k), where k = {2, 5, 10, 25}.Ideally, we want to see 100% accuracy with k = 2, which indicates that the two diagonal words are the top 2 choices.Note that when only using the cache, a k value of 50 would achieve perfect accuracy, as it would include the entire local history.In addition, we want to empirically verify that cache LM with local memory can break the softmax bottleneck.To this end, we calculate the rank of log-probability matrix A ∈ R N ×V (Section 4.1) using 500 examples (concretely, N = 4750 and V = 50257 for GPT-2 based models) under the full setting.
Open-Ended Generation.We mainly evaluate the coherence of model-generated continuations.Following Su et al. (2022), coherence is approximated by the cosine similarity of the SimCSE (Gao et al., 2021) sentence embeddings of the prompt and the continuation.In addition, following previous works, we report n-gram diversity (Meister et al., 2022) and MAUVE (Pillutla et al., 2021) scores for a more general evaluation.We hope HISTALIGN not to harm diversity and MAUVE.We also run human evaluation on Amazon MTurk to ask workers to compare the continuations generated by TRIME and HISTALIGN.More details can be found in Appendix B.1.
Abstractive Summarization.We mainly evaluate the faithfulness of generated summaries by three widely-used automatic metrics: FactCC (Kryscinski et al., 2020) and DAE (Goyal and Durrett, 2021), which are entailment-based metric; and Entity Precision (Nan et al., 2021, P ENT ), which calculates the percentage of entities in the summary that are present in the document.We also report ROUGE-L (Lin, 2004) for general content selection evaluation.Similarly, we conduct human Full Cache-Only Full Model Acc@2 Acc@5 Acc@10 Acc@25 Acc@2 Acc@5 Acc@10 Acc@25 Rank Data-to-Text.We mainly evaluate the faithfulness of model generations by NLI-Acc and SP-Acc (Chen et al., 2020a) and two more recent metrics -TAPEX-Acc and TAPAS-Acc (Liu et al., 2022a).NLI-Acc is an entailment-based metric pre-trained on TabFact dataset (Chen et al., 2020b) using TaBERT (Yin et al., 2020), and SP-Acc first parses the sentence into a logical program and evaluates the execution accuracy.TAPEX-Acc and TAPAS-Acc are entailment-based metrics trained with TAPEX (Liu et al., 2022b) and TAPAS (Eisenschlos et al., 2020), respectively.Same as previous works (Chen et al., 2020a), we report BLEU (Papineni et al., 2002) for a surface-level evaluation.

Results
We verify the strength of HISTALIGN at aligning the cache component and thus improve the nexttoken prediction on ambiguous template in Section 6.1, coherence in open-ended prompt continuation in Section 6.2, and faithfulness in abstractive summarization and data-to-text in Section 6.3 and Section 6.4, respectively.

Importance of Cache on Ambiguous Template
We show the results of the Ambiguous Template in Table 1.First, it can be seen that the original GPT2 model has pretty bad performance in the cache-only setting, especially considering Acc@2.This is ex-pected because the original model is fine-tuned using the cross-entropy loss without the cache component involved, and thus applying cache at test time may not be helpful.Second, though TRIME (Zhong et al., 2022) generally outperforms the original model in the full setting, its cache-only Acc@2 and Acc@5 are similar to the original model.Considering that all target words are present in the history, this result indicates that despite the fact that TRIME uses cache during training, its cache component is still misaligned and has limited contributions to the final performance.
In contrast, HISTALIGN achieves high Acc@2 with only the cache module, substantially outperforming the original model and TRIME on both model sizes, which demonstrates the effectiveness of our contrastive loss for aligning memories better.As a result, HISTALIGN outperforms both baselines across all k in the full setting.And the improvement holds for both model sizes, though with smaller gaps for the large model.This observation is consistent with our discussion in Section 4.1 that a larger model with a larger hidden dimension suffers less from the softmax bottleneck, while local memory can help break this bottleneck of any parametric LM.This is also empirically verified by the rank of the log-probability matrix reported in Table 1, where we see that the rank of the original model is upper-bounded by its hidden dimension (768 for GPT2-small and 1280 for GPT2-large), and having a local cache breaks this bottleneck.Finally, we present two qualitative examples in Table 9. See detailed discussions in Appendix C.
Experiment on recent LLM.We also fine-tune LLaMA2 7B model (Touvron et al., 2023).Interestingly, we find that LLaMA2 achieves 0% accuracy for Acc@{2,5,10} when evaluated zero-shot.After fine-tuning, the model achieves 100% accuracy without any cache.This is expected, as the task is a simple synthetic task, and the model, compared to GPT2-large, is 10x larger, and the hidden size is 3.2x larger (1280 → 4096).Thus, as mentioned in Section 4.1, the model alleviates the softmax bottleneck due to its larger hidden size.
However, we still observe the two problems with LLaMA2.First, the problem of softmax bottleneck still exists, as the rank of its output log-probability matrix A is still upper-bounded by its hidden size of 4096, as we find that its empirical rank is 3332.This means that it is still theoretically less expressive than highly context-dependent natural language.Second, TRIME is still not able to make good use of the cache, i.e., misalignment still exists.As shown in the Table 2, TRIME achieves 0% accuracy for Acc@{2,5,10} under the cache-only setting, which shows that the issue of misalignment is even more apparent for larger language models: Since the token logits perform well enough, the model does not learn to use the cache anymore.Nevertheless, as shown in the table, our training objective can enforce the use of the local cache and achieve 100% accuracy, which is consistent with our findings from smaller models.
The presence of these two issues showcases that there is still room for improvement on LM's context dependency, as HISTALIGN outperforms TRIME in making good use of cache.

Coherence in Open-Ended Generation
The results of prompt continuation can be found in Table 3. Across both sizes of the model, we observe an improvement in coherence with TRIME and a larger improvement with HISTALIGN.The effect of HISTALIGN is especially prominent for the smaller model, where coherence increases by 7.5 points compared to the original model, and 3.7 points over TRIME.This validates our hypothesis that HISTALIGN can improve the coherence of LMs.When looking at MAUVE, HISTALIGN improves by 0.8 points and 0.7 points over GPT2 and TRIME respectively when using small models.On the large model, while TRIME achieves the best Besides automatic evaluations, we also conduct a human evaluation, the results of which are shown in Table 4. On both fluency and coherence, human raters prefer the continuations by HISTALIGN more than that by TRIME.This confirms the observation from the automatic evaluations that HISTALIGN does improve especially on coherence.

Faithfulness in Abstractive Summarization
The summarization results are shown in Table 5. TRIME improves faithfulness over the baseline on XSum, but the improvement is not clear on CNN/DM.In contrast, our HISTALIGN method greatly improves over the baseline, especially on DAE and P ent , which are specifically targeted towards hallucinations.Concretely, we improve FactCC by 0.91 points, DAE by 4.78 points, and P ent by 3 points on the XSum dataset.HISTALIGN improves the metrics on CNN/DM as well though to a smaller degree.This shows that allowing the model to pay specific attention to previous contexts in the input is helpful in reducing hallucinations.We note that the ROUGE-L score for HISTAL-IGN is lower than the original model.This ROUGEfaithfulness tradeoff has been observed by many previous works (Chen et al., 2021;Kryscinski et al., 2020;Wan and Bansal, 2022;Wan et al., 2023), where the reference summary inherently contains hallucinations and thus does not overlap highly with the more faithful generated summaries.
To confirm this, we conduct a human evaluation.The results are shown in Table 6.HISTALIGN achieves the best faithfulness score, which is statistically significantly better than BART.This confirms our observation from automatic metric results in Table 5.Though there is a small drop in informativeness, the difference between the three methods has no statistical significance.5This shows that the drop in automated metrics such as ROUGE-L does not necessarily mean a decrease in informativeness.

Faithfulness in Data-to-Text Generation
The results on LogicNLG are shown in Table 7. Similar to abstractive summarization, HISTALIGN can improve faithfulness on LogicNLG.Out of the four faithfulness metrics, HISTALIGN achieves the highest NLI-Acc, TAPEX-Acc, and TAPAS-Acc: HISTALIGN achieves 0.6 and 0.8 point improvements on TAPEX-Acc over BART and TRIME respectively, and a 1.74 point improvement on TAPAS-Acc over the BART model.In the meantime, HISTALIGN obtains the best BLEU scores.

Discussion and Conclusion
In this work, we improve the context dependency of LMs by introducing a novel cache-LM training objective, HISTALIGN, which improves the existing cache-LM objective by adding an order-informed contrastive loss for the cache component.On a synthetic dataset, we show that HISTALIGN is effective at retrieving the desired memories from the cache and breaking the softmax bottleneck.Furthermore, we demonstrate the effectiveness of HISTALIGN at improving the coherence of open-ended generation and improving faithfulness of abstractive summarization and data-to-text generation.
We want to emphasize a couple of salient points with the recent trend of pushing for larger and Table 7: Performance on LogicNLG (data-to-text generation) evaluated by BLEU scores, NLI-Acc (NA), SP-Acc (SA), TAPEX-Acc (TA), and TAPAS-Acc (TS).HISTALIGN improves over two baselines on BLEU and three faithfulness metrics: NA, TX, and TS. more powerful models.Firstly, attention mechanisms alone cannot break the softmax bottleneck, as shown in Table 2. Secondly, while increasing the model size can mitigate this bottleneck, the problem will persist unless we reach a size that truly encapsulates the complexity of human language.Cache-LM is a light alternative for breaking softmax bottleneck theoretically and improving context dependency empirically.
hope that in the future we can explore scaling up the approach on large LMs to various tasks.We believe that our method is still helpful for larger models.But as larger models suffer less from softmax bottleneck (Section 4.1), how much it can help is an interesting problem to study in the future.Another current limitation of this work is that due to the additional hyper-parameters (the λ of the margin and the weight α of the contrastive loss), it becomes less straightforward to incorporate our HISTALIGN objective into pre-training compared to TRIME.The training objective also considers that each token has a fixed margin (and thus assumes that each token is equally different), which can be improved by dynamically adjusting the margins.Although fine-tuning is cheaper and we show effective gains using HISTALIGN in fine-tuning, how to use HISTALIGN to pre-train LMs is also an interesting future work direction.

Ethical Considerations
As the OpenAI team pointed out, GPT-2 does not distinguish fact from fiction, so it can not support use cases that require the generated text to be true.In addition, GPT-2 reflects the biases inherent to the data they were trained on, so it can not be deployed unless the deployers first carry out a study of biases relevant to the intended use case.Though our HISTALIGN improves the coherence of GPT-2 generations, the above statement still holds.Similarly, despite that HISTALIGN improved the faithfulness of BART-large generations for abstractive summarization and data-to-text generation, such systems cannot be directly deployed and used in factualitysensitive scenarios without further checks in place.library (Lhoest et al., 2021) for loading the XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015) datasets.And we use Huggingface's Metrics library for calculating ROUGE scores.Training the original model, TRIME, and HISTAL-IGN all took around 5 hours for XSum and training orig, TRIME and HISTALIGN all took around 4 hours for CNN/DM on 4 A6000s.

A.4 Data-to-text
We follow Liu et al. (2022a) for pre-processing dataset, such as adding numerical pre-computation to the tables.We use a contrastive weight α = 0.5.LogicNLG (Chen et al., 2020a)

B Human Evaluation Details
For both human evaluations, we use Amazon Mechanical Turk to do the annotation.We have the same set of requirements: The workers need to be from the United States, have more than 10,000 number of HITS approved, and an approval rate greater than 98%.

B.1 Open-ended Generation
We use Amazon Mechanical Turk to annotate whether human prefers the continuation by TRIME or by HISTALIGN.We do not include the original model, since TRIME shows better performance on the automatic metrics.We select examples where the difference between their characters is less than 200 characters to ensure that the length is similar (since shorter texts will naturally be more coherent).We collect 3 annotations per example for 100 randomly selected examples, yielding 300 annotations.We take the percentage of passages that are judged as coherent and/or fluent.
We pay 0.5 USD per HIT, and the average time it takes is around 2.5 minutes, which yields an hourly rate of ≥ $12 per hour.An example of the annotation page is shown in Figure 3.

B.2 Summarization
We follow the same setup as Wan and Bansal (2022), and also use a qualification test where we rate the faithfulness of the selected generated summaries.Only workers with the correct annotation can perform the actual task.
We select the most important sentences and replace the less relevant sentences with an ellipsis to reduce the overload for the workers.We select ten most relevant sentences from the document by cosine similarity of the sentence embedding using SentenceTransformer 9 (Reimers and Gurevych, 2019) for each summary and then combine and show all the selected relevant sentences from each summary.
Each task consists of three unique workers, where we take the mean as the scores for this document.The final score is the mean factuality score across all documents.The average time for each task is around 2.5 minutes and we pay 0.5 USD per task, hence an hourly rate of ≥ $12 per hour.An example of the annotation page is shown in Figure 4.

C Qualitative Results on Ambiguous Template
We present two qualitative examples in Table 9.We see that both the original model and TRIME have difficulty in outputting the two correct words as the top two choices.This is also reflected by the cacheonly results, where irrelevant words, such as and, I, 9 We use the all-mpnet-base-v2 model.
the get high probabilities.In fact, the cache similarities of the original model are similar to those of TRIME, again indicating that there is no guarantee of well-aligned memories, despite training with the cache.HISTALIGN nevertheless returns the two target words as the top two choices for both the full and cache-only settings, showing that the model benefits from the well-aligned memories through our contrastive objective.

D Sample Outputs
We show sample outputs for prompt continuations in Figure 5, summarization in Figure 6 and Figure 7, and data-to-text in Figure 8.  HISTALIGN: up and stopping things, trying to prevent every mistake that might happen.If it weren't for the fact that I was always so careful with everything, I might be a normal person who just goes about their business.A normal person like you wouldn't have the ability to go back in time to prevent yourself from ever having died in the first place.Every mistake that might have happened could have killed me then and there, and it wasn't an easy road to travel down.I was lucky enough to have only been born on July 4th, 1996.I've been trying to do the right thing all these years.I've avoided death on more than one ...

Prompt: Crown Minotaur Expedition
The first week of the expedition was slow going.The terrain was rocky and unforgiving, as if the Mother Nature had forgotten to smooth out this patch of land.The Minotaurs had the most trouble due to the GPT2-Large: steep slopes of the mountain pass.When a Minotaurel came up the mountain pass, she would pass straight through, just passing through the pass.It was just one of many impassable ridges, that the Miatarel would face to traverse the pass.The only reason it was never impassable was due to the massive, hairy beasts, which covered half of the pass, and caused many people to leave the mountains.This, plus the numerous Minotaurel who passed through the pass would create an increasingly high attrition rate on the expedition.Still, the Miatarel had learned of the pass through the passing ... TRIME: steep slopes of the mountain pass.When a mountain pass was at its highest, the Minotaurs were no longer able to pass into the valley below.It was just too steep.Luckily there was plenty of rock that they could use as a ramp and they would be able to crawl up with their arms, legs, and wings intact.The trip began on the third day after lunch.The trail was still rough, as many of the boulders were nearly a foot high.By now they could barely lift the weight of their bodies, much less their bags full of provisions and equipment.The road ahead of them was now covered in small puddles.This was typical of these ... HISTALIGN: steep slopes of the mountain pass.When a mountain pass was at its highest, the Minotaurs were no longer able to pass into the valley below.It was just too steep.Luckily, they were able to set up camps and set out in the first few days.This was only going to be a matter of time until they realized where they were going.This place was far away from anything they knew of.The only light was the lanterns on their back, and the lanterns were only good for a few minutes.As for how they got there, no one is quite sure.They have not been able to find any of the equipment they carried when they left ... Document: David Lipton, second in command at the IMF, outlined some of these risks in a speech to the National Association for Business Economics in Washington on Tuesday."The IMF's latest reading of the global economy shows once again a weakening baseline," he said."We are clearly at a delicate juncture."The comments come after weaker-than-expected trade figures from China showing that exports plunged by a quarter from a year ago.The IMF has already said it is likely it will downgrade its current forecast of 3.4% for global growth when it next releases its economic predictions in April.The dismal picture is one that has on-going ramifications for businesses and industries that bet on China's growth story.Read more from Karishma: Why a story about bulk shipping matters BART: The International Monetary Fund (IMF) has warned that the global economy is at a "critical juncture".TRIME: The International Monetary Fund (IMF) has warned that the global economy is in a "dangerous situation".HISTALIGN: The International Monetary Fund (IMF) has warned that the global economy is in "a delicate juncture".
Document: Coventry University's Scarborough campus has been built on the town's former Weaponness Park and Ride site.About 200 students have begun courses at the site, though it is expected to eventually be home to more than 2,000 students.The building, which includes engineering and science labs, a mock law court and a library, is part of a £50m sports and education facility.Professor Craig Gaskell said: "Launching our new state-of-the-art building is a huge milestone for us and demonstrates our commitment to Scarborough and the Yorkshire coast area."A new University Technical College has been built nearby and Scarborough Athletic FC's new 2,000-seater stadium is also under construction on the site.Coventry University also has a campus near London's Liverpool Street Station and recently announced it will open a campus in Dagenham in September 2017.
BART:A university has officially opened its first campus in North Yorkshire.TRIME: A new university campus has been officially opened in North Yorkshire.HISTALIGN: A university campus on the Yorkshire coast has opened to the public.

Figure 1 :
Figure 1: An illustration of HISTALIGN and baseline cache-LM.The input example is from Chang and Mc-Callum (2022).Our HISTALIGN is able to assign high probabilities to both king and woman, and thus is able to tune down the weight of the hallucinated token queen from the softmax probability.Current cache language models (baseline) give high probabilities to irrelevant tokens in the cache and thus are at risk of producing hallucinated or incoherent tokens.

Figure 2 :
Figure2: Illustration of our HISTALIGN training approach.We first get local cache by combining the hidden states in local context with their target tokens, and then rank them according to embedding similarity.The ranked memories are then used to train with the margin loss.This ensures that negative yet similar words (e.g.accommodations) will be closer in the vector space than irrelevant words (e.g.children).

Figure 3 :Figure 4 :
Figure 3: Human annotation page for evaluating coherence and fluency for prompt continuation.

Figure 5 :
Figure 5: Sample outputs for open-ended prompt continuation.
St Kilda had the highest Score as an Away Team in the 1928 Vfl Season BART: Hawthorn had the lowest Away Team Score of any team in the 1928 Vfl Season TRIME: Geelong had the lowest Score of 8.8 (56) while Hawthorn had the highest Score of 11.13 (79) HISTALIGN: St Kilda was the Away Team with the highest Score in the 1928 Vfl Season

Figure 8 :
Figure 8: Sample outputs for data-to-text generation.

Table 1 :
Results on Ambiguous Template.HISTALIGN achieves the best performance in both full and cache-only settings.We also empirically show that TRIME and HISTALIGN break the softmax bottleneck.

Table 5 :
Performance on abstractive summarization tasks.HISTALIGN consistently improves faithfulness over the two baseline methods on both datasets.
Every time you die, you are transported to the time of the first event that ultimately caused your death.One day, you are hit by a car and sent 12 years into the past.It's been a strange life; constantly setting GPT2-Large: up and killing other people.But it's done, and now it ' s time to settle down in this new world.I ' d never actually thought about my past life before, I just knew that I went back and lived, but this time there ' s a complication.I ' ve been to many places in my life, so why is there such a big problem?It ' s the weirdest thing, I could go anywhere in time and never get the feeling, but this time it ' s weird.The cars are just like the last time, the ones that hit me, there ' s just a hole in the middle of them.I ' m in a small, abandoned office, and the walls aren ' t quite made out of cement.I think that ' s a nice touch ... TRIME: up and killing other people.But it's done, and now it's time to put the final touches on this latest creation.I'll be frank, I don't really know what I'm supposed to do.A couple months ago, I was on my way to my parents, and there was an accident on the side of the road.It was some kind of accident; the car was hit by a drunk driver, and it fell on my head.Luckily, I was alive.I was given a few months to live, by the hospital.I can live with that, at least until I can figure out what happened.Anyway, I'm now waiting for the day when I meet my next death, and I have to be prepared for the worst.I can try to avoid death ... Prompt:

Table for "
1893 english cricket season": John Hearne, played in more Match than any other Player, with 20 9 BART: Bill Lockwood and Arthur Mold had the same number of Match TRIME: Bill Lockwood and Arthur Mold both played 27 Match in the 1893 English Cricket Season HISTALIGN: John Hearne had the most Match with 29