On-the-Fly Attention Modulation for Neural Generation

Despite considerable advancements with deep neural language models (LMs), neural text generation still suffers from degeneration: the generated text is repetitive, generic, self-contradictory, and often lacks commonsense. Our analyses on sentence-level attention patterns in LMs reveal that neural degeneration may be associated with insufficient learning of task-specific characteristics by the attention mechanism. This finding motivates on-the-fly attention modulation -- a simple but effective method that enables the injection of priors into attention computation during inference. Automatic and human evaluation results on three text generation benchmarks demonstrate that attention modulation helps LMs generate text with enhanced fluency, creativity, and commonsense reasoning, in addition to significantly reduce sentence-level repetition.


Introduction
Neural text generation is critical for a wide range of downstream natural language applications. However, the standard approach -using a Transformerbased (Vaswani et al., 2017) language model (e.g., Radford et al., 2019) with maximum likelihood fine-tuning and non-stochastic decoding -is known to exhibit degeneration (Welleck et al., 2019). Despite being pre-trained on large amounts of data, text generated by neural models is observed to be repetitive, generic, self-contradictory, and lacking commonsense .
Many explanations have been proposed for neural text degeneration, including inappropriate training objectives (Welleck et al., 2019) and decoding discrepancies relative to human language (Holtzman et al., 2018. While the aforementioned may be factors for neural degeneration, we show Figure 1: Example of fine-tuned GPT2-L outputs without (top) and with (bottom) attention modulation on αNLG. The task is to generate a plausible explanatory hypothesis H for observations O 1 and O 2 . Our proposed attention modulation injects the task-specific prior -LMs should consider both observations relative equally -through balancing the sentence-level attention weights (Eqn. 5) in Transformer blocks during inference. Applying attention modulation with the aforementioned prior make sentence-level attentions from generation to observation pairs (O 1 ,O 2 ) more balanced 2 , which are reflected in the sentence-level attention heatmaps of GPT2-L (darker = lower attention) across layers (y-axis) and heads (x-axis). that insufficient learning of task-specific characteristics -reflected in the self-attention mechanism in transformer blocks -is associated with neural text degeneration. We demonstrate that degeneration is alleviated if we inject priors through attention modulation (AttnM) during inference.
Self-attention -the ubiquitous component of Transformers -is task-agnostic with a large learning capacity for many NLP tasks (Vaswani et al., 2017;Devlin et al., 2019;Brown et al., 2020). It learns the general characteristics of language processing through pre-training on large amounts of unlabeled data. For example, multiple analyses have suggested that attention patterns in pre-trained Transformers implicitly encode syntactic information (Raganato and Tiedemann, 2018;Michel et al., 2019;Vig and Belinkov, 2019). In sequence transduction tasks, these learned characteristics, embedded in attention, make pre-trained Transformers a powerful language model (Radford et al., 2019).
A final task-specific step is typically required for adapting a task-agnostic language model to perform the desired task 3 . However, these taskspecific characteristics might not sufficiently coincide with general characteristics even after finetuning. For example, task-specific characteristics embedded in attention patterns -such as word alignments for machine translation -are often noisy and imperfect for generalization (Kobayashi et al., 2020).
We show that insufficient learning of taskspecific characteristics, reflected in sentence-level attention patterns 4 often being out of focus, may be associated with neural text degeneration ( §3). Based on this observation, we propose a simple attention modulation framework that can dynamically redistribute sentence-level attention weights by injecting task-specific priors in Transformer blocks for different downstream tasks ( §4). Remarkably, in long-range narrative story generation, abductive reasoning generation and constrained commonsense text generation, both automatic and human evaluation have shown improved quality in fluency, dullness, repetition, and commonsense reasoning with attention modulation ( §6).

Background
We briefly discuss how vanilla attention works, as well as Transformer architecture used in this paper.
Single-headed attention Given a sequence of ddimensional input vectors x = {x 1 , . . . , x n }, attention mechanism computes a set of weights based on a query vector y i ∈ R d : Attn(x, y i ) = (α i,1 (x, y i ), . . . , α i,n (x, y i )) (1) 3 Brown et al. (2020) have shown that GPT3 greatly improves task-agnostic, few-shot performance, but still struggles on tasks with strong task-specific characteristics. 4 We study the global context in the multi-sentence prompts and choose sentence-level attention (Eqn. 7) as the experiment unit, since sentences are linguistic units of complete meaning.
where α i,j is the attention weight that y i pays to x j . One formulation of attention -scaled dot product attention -is computed as: where query q(·) and key k(·) functions are linear transformations. In self attention, every x i is used as the query vector (y i ). An updated representatioñ x i is computed as a weighted sum of value vectors that are linearly transformed by v(·): Multi-head attention In multi-headed attention (MHA), N h attention heads are computed independently to obtain the updatedx i : α h i,j follows Eqn. 2 except the model dimension in each head h is often reduced to d h = d N h .x i is obtained by the concatenation of lower-rank representations from all heads and W o ∈ R d×h·d h .
GPT2-L GPT2 (Radford et al., 2019) is a family of Transformer-based language models (LMs) that follows the architecture of stacked decoder. As GPT2 follows a multi-layer and multi-headed setting, α i,j is specific to a layer l and head h, noted as α l,h i,j . We use the GPT2-L model that has 36 layers with 20 heads per layer (762M total parameters).

Neural text degeneration vs. attention
As researchers have sought to understand the internal mechanisms of Transformers, the attention patterns exhibited by these heads have drawn considerable study (Vig and Belinkov, 2019;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019). We perform sentence-level attention analysis to explore whether aggregated attention patterns are associated with neural text degeneration.

Sentence-level attention
We first define the sentence-to-sentence attention of a language model M with L layers and H heads. Given two sentences p and g such that p precedes g, the meanᾱ l,h g,p and maxα l,h g,p sentence-to-sentence attentions from g to p for layer l and head h are: The aggregated sentence-to-sentence attention over the Transformer architecture M is defined as: where α ∈ {ᾱ,α} computes either the mean or the max sentence-level attention over M.

Is neural text degeneration related to attention patterns?
We conduct experiments to evaluate whether neural text degeneration is associated with sentence-level attention patterns. Empirical results on two types of neural degeneration that are easy to detect -repetition and lacking commonsense reasoning under constraints -reveal their association.
Repetition vs. attention One common form of neural text degeneration is sentence-level repetition (Welleck et al., 2019). This type of degeneration happens frequently in our experiment on ROCStories test set ( §6.1): given a five-sentence prompt, 35.4% of the consecutive sentences from the next five greedily generated sentences by the fine-tuned GPT2-L are exact repetitions. We check whether sentence-level attention patterns behave differently when generating repeated or different consecutive sentences.
We inspect the attention behavior by measuring the change of sentence-level attention when generating two consecutive sentences. The generations of fine-tuned GPT2-L on ROCStories test set are separated into two subsets {D repeated , D different }: in which consecutive sentences that are either repeated (i.e., degenerate) or different. Given the finetuned GPT2-L language model M, we measure the sentence-level attention change ∆ on the prompt sentence p j∈{1,...,5} while generating the consecutive sentence pair (g i , g i+1 ), i ∈ {1, . . . , 4}, aggre- whereᾱ M g i ,p j is the mean sentence-level attention from sentence g i to sentence p j defined in Eqn. 7. Figure 2 plots the aggregated mean sentencelevel attention change over prompt sentences when GPT2-L generates repeated (red) or different (blue) consecutive sentences. The sentence-level attention changes are vastly lower on all prompt sentences when generating repeated consecutive sentences. Thus, sentence-level repetition may be correlated with the insufficient change of sentence-level attention. In §4 and §6, we show that generation quality can be vastly improved by injecting the prior -attention should look at the prompt differently when generating different sentences -through our proposed attention modulation.
Lack of commonsense reasoning vs. attention Text generated by neural language models is also observed to be lacking commonsense reasoning (Mao et al., 2019). We check whether this type of neural degeneration is associated with attention patterns. A benchmark dataset for generative commonsense reasoning -CommonGen (Lin et al., 2020) -is used as our test bed. CommonGen is designed for constrained commonsense reasoning: given a set of common concepts (e.g., use, tool, piece, metal); the task is to generate a coherent and plausible sentence covering all these concepts (e.g., "a piece of metal is used for making tools"). Covering the concepts in generation requires relational reasoning with background commonsense knowledge. Each concept is represented as a prompt sentence in our experiments. 5 During generation, a concept (e.g.swim) is covered if its reflected form (e.g.{swim, swimming, swam, swum}) is generated in the CommonGen test set. We use a finetuned GPT2-L for the generation. Among the 5988 concepts in the prompt, about 75% of them are covered in the generation of GPT2-L. We can then easily separate sentence-level attention from the generation to the concept into two subsets: concept in the prompt that is covered or uncovered by the generated sentence. Table 1 shows the results of max sentence-level attention (Eqn. 7) of the finetuned GPT2-L on the CommonGen test set 6 . We can observe that sentence-level attention from the generation to the concept is vastly higher when the concept is covered. Compared to that of uncovered concepts, the aggregated max sentence-level attention is 15.4% higher. Therefore, failing to generate a common concept through reasoning may be associated with insufficient attention to the concept.
In both cases, neural text degeneration is associated with insufficient attention to elements that are important for downstream generations. This motivates us to explore whether we can inject these priors in the language model by altering the attention mechanism to alleviate degeneration.

Method
This section describes our method -attention modulation -that can alleviate neural text degeneration. In §4.1, we describe the general attention modulation framework. In §4.2, §4.3, and §4.4, we discuss the priors injected through attention modulation for three different tasks: narrative story generation, abductive reasoning generation, and constrained commonsense reasoning.

Attention Modulation
Attention modulation aims to change the attention weights of a Transformer-based language model during inference, so that the generation can reflect priors that alleviate neural text degeneration. This additional signal is added to the self-attention computation in the Transformer blocks.
We reformulate the attention computation of Eqn. 2 by adding an attention reweighting function f , where priors can be injected. Given a sequence of input tokens x, the self-attention from x i to x j (i ≥ j) while generating the t-token is reformulated to: is the attention reweighting function and α t−1 is the attention weight matrix for all layers and heads in the Transformer architecture at time step t − 1. The attention reweighting function f can be either pre-defined or learned. In our experiments, we inject pre-defined sentence-level priors (heuristics) through f and show that this injection alleviates neural text degeneration. We leave the learning of better reweighting functions automatically to future work.
In the following sections, we describe sentencelevel attention reweighting functions that are used for three different text generation tasks.

ROCStories: narrative generation
As shown in §3, sentence repetition in long-form generation may be associated with insufficient attention change while generating consecutive sentences. To amplify the attention changes, we can redistribute sentence-level attention with some priors while generating consecutive sentences.
We choose the prior that language model should consider long-range context during generation, as we observed that attention mostly focuses on the near history in many cases (Appendix A.1). Note this prior also increases the sentence-level attention change while generating consecutive sentences: the sentence-level attention for all previous sentences is always re-balanced based on the newly-generated sentence. To balance the attention of tokens in each sentence received while generating the next sentence, we define the attention reweight function in Eqn. 9 with the aforementioned prior for ROCStories as: As later sentences in the prompt usually receive larger sentence-level attention weights (Appendix A.1), attention reweighting function defined in Eqn. 10 will add a large weight to tokens in the early sentences and a small weight to tokens in the late context sentences. The simple heuristic of balancing context sentences to be considered relatively equal, namely more weights on early context sentences, might not be optimal prior. However, it improves the long-form story generation in multiple measures, including fluency, interesting, newness, and repetition ( §6.1).

αNLG: abductive reasoning generation
The second benchmark dataset we tested with attention modulation is αNLG . This dataset is proposed for abductive reasoning generation: given two observations O 1 and O 2 , the model needs to generate a valid hypothesis h that explains what happened between the two observations. For example, given O 1 : "Today was the first day of school." and O 2 : "I hate school.", the task is to generate h such as "The teacher made fun of me." as a plausible explanation.  has shown that the fine-tuned GPT2 performs far below human performance on the αNLG task. We hypothesis that this may be associated with insufficient learning of sentence-level attention to both observations; for example, the model might over-fit to one of the observations for generation. Thus, we inject the prior -the language model should consider both observations relative equally -while generating a plausible explanation. This prior can be injected with attention reweighting function defined in Eqn. 10.

CommonGen: constrained commonsense generation
The third benchmark is CommonGen -a constrained text generation challenge for generative commonsense reasoning. CommonGen requires machines to generate a realistic sentence using all concepts from a given concept set by conducting commonsense reasoning over the relations among the given concepts. To successfully generate a plausible and grammatical sentence that follows the commonsense, models need to conduct commonsense reasoning over the relations among the given concepts. Our experiment in §3.2 shows that the fine-tuned GPT2-L can only cover about 75% of concepts during generation. We infer from Table  1 that this may be associated with GPT2-L giving insufficient sentence-level attention to uncovered concepts. Thus, we propose a simple heuristicmodel should pay more attention to concepts that are not covered yet -to be injected with attention modulation.
Consider the prompt with m concepts c = {c 1 , . . . , c m } and a partially generated sentence y 1 , . . . , y t−1 . While generating the t-th token, the sentence-level reweighting function from the i-th token to the j-th token in c k is defined as: Intuitively, if a concept c k is covered in the partial generation, attention modulation with Eqn. 11 will reduce the attention weights of the tokens in concept c k .

Experimental Setups
This section describes the experiment setups, including the baselines, decoding algorithms, datasets, and evaluation metrics.
Model architecture & baseline Attention modulation is architecture-agnostic and can be applied to any Transformer-based models that contain selfattention computation. We choose GPT2-L (Radford et al., 2019) for our experiments, which has achieved state-of-the-art performance on a variety of generation tasks (Vig and Belinkov, 2019). Attention modulation can be applied to any range of layers in the Transformer. To compare models with and without attention modulation on each of the three generation tasks, we use the best finetuned GPT2-L based on the validation set after fine-tuning for 4 epochs with the default settings.
Decoding Attention modulation directly changes the attention weights of the context tokens during inference. It is orthogonal to different decoding algorithms that change the searching strategies based on the softmax distribution emitted by Transform-  ers. 7 We present the results with non-stochastic decoding algorithms (i.e. greedy decoding and beam search), as generations based on them truly reflect the token-level probabilities predicted by the model (Holtzman et al., 2018).
Evaluation On ROCStories, we measure dullness, relevancy and repetition similar to Welleck et al. (2019). We report the number of unique tokens generated, where the generation is less dull if more unique tokens are generated. For repetition, we directly measure sentence-level repetition: two generated sentences are repeated if their strings are the same. For relevancy, we measure the percentage of generated tokens that appear in the prompt. Besides, we perform a human evaluation, where three annotators are asked to rate the generations based on fluency, interestingness, newness, relevancy, and repetition. On αNLG, we score the generated explanation with respect to the reference using the following automatic metrics: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), and CIDEr (Vedantam et al., 2015). In addition, we ask annotators to compare the generated explanations without and with attention modulation. Human judges are asked to decide which system provides a more plausible explanation of the observations. On CommonGen, we report SPICE (Anderson et al., 2016) -a measure that evaluates semantic propositional content, in addtion to BLEU, 7 These search-based decoding algorithms do not resolve the poorly generated token-level probabilities. ROUGE, METEOR, CIDEr. We also report Coverage (Lin et al., 2020), which computes the average percentage of input concepts that appear in the lemmatized outputs. We conduct a human evaluation following the protocol of Lu et al. (2020). Human judges are asked to compare two systems in terms of fluency, coverage (covers the concept), and overall quality (covers the concepts and follows commonsense).

Result
In this section, we present the vast improvements of the fine-tuned GPT2-L with attention modulation on three narrative generation and generative reasoning tasks: ROCStories, αNLG, and CommonGen.   while increasing the relevancy of generated sentences to the original story. We can observe a vast improvement in the number of unique generated tokens using attention modulation, indicating a reduced repetition rate (confirmed by the % number of repeated sentences in the next five sentence generated -35.43 vs. 17.49 for our approach). This intuition is confirmed by our human evaluation in Figure 3, where the GPT2-L with attention modulation produces sentences that are more fluent, more interesting, more novel, and less repetitive than the original decoder. Furthermore, we note that the difference in performance across these evaluation categories generally increases as the number of generated sentences increases, indicating less sensitivity to long-form degeneration. Table 4 presents the automatic and human evaluation results on αNLG. We can see that our model performs similarly with and without attention modulation in terms of automatic evaluation. However, our human evaluation results in the last column show that overall, the human judges prefer the explanations produced using attention modulation significantly more than those of the original model. With 100 samples generated, 33% of the time, human judges select explanations generated with attention modulation as more plausible. In contrast, explanations from the original model are only preferred 14% of the time. Table 3 shows the automatic evaluation results on the CommonGen dataset. We separate different settings of decoding algorithms in blocks. By injecting the prior -the model should put more attention on uncovered concepts -into the GPT2-L with attention modulation, we can improve the text generated in every automatic measure significantly. Interestingly, despite our attention-reweighted decoder only encouraging coverage, we see all the other measures such as ROUGE, BLEU, METEOR, CIDEr, SPICE improve, as well. These improvements also hold when we use a different base decoding algorithm, such as beam search. Again, the performance improvement for using attention modulation is significant over all measures. Thus, unlike decoding algorithms that improve downstream tasks through truncation of the sampling distribution, we directly re-calibrate the token-level probabilities predicted by the model by altering attention patterns in the Transformer blocks during inference.

CommonGen
We also conduct a human evaluation to check whether this improvement in the automatic metrics transfers to human judgments. In Table 6, we see that our attention modulation algorithm significantly outperforms the original inference model on every measure -from fluency, quality, and overall performance.   more prominent when the fine-tuning data size is small. For example, adding attention modulation can improve coverage by 9.43% on the GPT2-L fine-tuned with only 10 examples. This not only validates that priors we injected into the model are suitable for improving the downstream task performance, but also shed lights to use attention modulation on different few-shot learning scenarios where the number of training examples is limited.

Related Work
We propose to use attention modulation to heuristically re-balance sentence-level attention for neural text degeneration. At least three domains of work are closely related to our proposal, namely, attention pattern analysis, work that focuses on changing or approximating learned attention patterns, and work for countering neural text degeneration.
Attention analysis: Previous work has investigated the attention patterns within the local context of a sentence. These works highlighted that attention patterns in Transformers implicitly encode syntactic information such as dependency relations (Htut et al., 2019), and part-of-speech tags (Vig and Belinkov, 2019;Raganato and Tiedemann, 2018). Other works observed that attention patterns can provide explanations (Wiegreffe and Pinter, 2019) or coarse word alignments in machine translation (Zenkel et al., 2019;Kobayashi et al., 2020). In contrast to these works, we analyze sentence-level attention patterns for neural text degeneration, and propose to directly modify the attention computation to reduce it.
Alternative attention: Many works have been proposed to change attention mechanisms to optimize their O(n 2 ) complexity. Some promising directions in this space include sparse attention mechanisms (Beltagy et al., 2020;Zaheer et al., 2020) and linearized attention (Choromanski et al., 2021). These alternative attention mechanisms require training the model and are used as replacements to the original attention mechanism for fast training or reduced computation. Our work is fundamentally different as we seek to inject priors into the standard attention mechanism during inference (without re-training the model).
Neural text degeneration: Previous works seek to solve neural text degeneration by changing the training objective to reduce the likelihood of common tokens (Welleck et al., 2019), or modifying the decoding algorithm by truncating the sampling distribution (Holtzman et al., 2018. Specifically, Welleck et al. (2019) introduce an additional training loss that reduces the likelihood of common tokens. Holtzman et al. (2018 propose stochastic decoding algorithms with truncation of the sampling distribution. Our work is orthogonal to these methods by injecting priors into the model's attention computation during inference.

Conclusions and future work
Neural language models often exhibit degeneration: the output texts are repeated, bland, and inconsistent. Our empirical analyses show that neural text degeneration may be associated with insufficient learning of task-specific characteristics by the attention mechanism. We propose a simple but effective module -attention modulation -that can inject priors for better generation through re-balancing the attention weights during inference. Results on three different narrative and commonsense generation tasks indicate that attention modulation can reduce repetition and enhance commonsense reasoning while maintaining fluency and coherence.

A Appendices
A.1 How does a language model use attention to model a multi-sentence prompt?
Sentence-level attention portion To reveal which part of context -near or distant history -are important for context representation, we compute aggregated mean sentence-level attention (Eqn.5 in the main text) each prompt sentence p i received, while generating the sentence g 1 after the prompt. We observe from Figure 4 that GPT2-L mostly attends to the nearest sentence (p 5 ) during the generation. This effect is especially prominent in the early and middle layers. In the late layers, the attention from different sentences evens out. This observation is consistent with previous analysis of attention patterns within sentences such that deeper   (Vig and Belinkov, 2019).
Sentence-level attention entropy Khandelwal et al. (2018) observed that LSTM represent distant context as topics; only a few token in the distant context are used to compute the context representation. We check whether this observation also holds on Transformer-based models by computing attention entropy. This sentence-level attention entropy of p i based on the attention from g 1 to p i at layer l m over a corpus X is defined as: |X| · |H| · |p i | · |g 1 | (12) where h is a head in layer l m and α h j,k is the attention weight from x j ∈ p i to x k ∈ g 1 for h. Figure  5 shows a clear separation of entropy over different sentences in the prompt, where more distant sentences have lower entropy values. This suggests that LMs only modelling distant sentences as topics -attention over key words being a proxy.

A.2 Hyperparameters
Attention modulation can be applied to any layer and any head in the Transformer based on our implementation. However, the weights learned by different heads in a particular layer have a large variance (Vig and Belinkov, 2019) and are subject to change from different training sessions. Therefore, we only reweight attention on all heads in different layers, where what layers are re-weighted are hyperparameters. We choose to reweight the consecutive layers from a starting layer l s to an end layer l e and performed a grid search on different layer ranges. For the start layer, we experimented with l s ∈ {0, 4, 8, 12, 16, 20, 24, 28, 32}; for the end layer, we experimented with l e ∈ {4, 8, 12, 16, 20, 24, 28, 32, 36}. The reweighting layers are chosen based on the validation set performance. On ROCstories, the GPT2-L are reweighted with l s = 8 and l e = 32; On αNLG, the propmt attention reweight Generated sentence with attention-decoding field. stand. look. = (1,3,2) A man stands looking at a sign in a field. field. stand. look. = (1,2,3) He looks up and sees a group of people standing in the field. field. stand. look. = (2,3,1) He stands in the middle of the field, looking down at the stands.  GPT2-L are re-weighted with l s = 12 and l e = 32; On CommonGen, the GPT2-L are re-weighted with l s = 24 and l e = 32.

A.3 Attention modulation and generation order
CommonGen provides the concept set in a random order, where models need to perform a relational commonsense reasoning to find the optimal order of them for generating a plausible sentence. We found that attention modulation provide signals for generation order given different reweighting functions (examples in Table 8). In this experiment, we guide the generation order by providing different initialization weights in the reweighting functions. We enforce different attention modulation weights based on the order we want the concepts to be generated. For examples, row 1 in table 8 means the concepts of (FIELD, STAND, LOOK) are initialized to be re-weighted by scales of (1,3,2). This interesting finding motivates us to conduct a permutation experiment on CommonGen. For a k concept set, we initialize the attention modulation weights based on the permutations of 1 to k (k! permutations in total) and generate k! sentences with attention modulation. We then select the generation that covers the most concepts 8 from these k! generations as output. We call this method "attention modulation with permutation". Table 9 presents the results of attention modulation with permutation. We see that just by enforcing the order in attention modulation, the coverage can be improved by another 10%. 8 If there is a tie, we choose the shorter generation. Figure 6 and 7 show the evaluation templates for tuckers for Rocstories and CommonGen 9 , respectively. On ROCstories, the inter-annotator agreements is 0.743 and fleiss kappa is 0.229 averaged over the five dimensions on 200 × 5 samples. The inter-annotator agreements are 84.6%, 72.2%, 76.1%, 78.5%, and 59.9% for fluency, interestingness, newness, relevance, and repetition. On αNLG, the human evaluation results are 33%, 14%, 10%, 32%, 11% for preferring outputs with attention modulation, without attention modulation, both, neither and no agreement. The inter-annotator agreements is 0.835 and kappa is 0.439 over 100 samples. On CommonGen, the inter-annotator agreements are 77.94%, 71.64%, and 67.50% for fluency, quality, and overall on 200 samples. prompt Jen was always afraid of flying. She needed to fly to Japan for an important meeting. Jen took some Dramamine and boarded her plane. Jen dozed off shortly after the plane took off. w/o AttnM Jen woke up to a huge headache. She had to cancel her trip to Japan. Jen was glad she had taken Dramamine. She was able to fly safely. Jen was glad she had taken Dramamine. w/ AttnM She woke up to a loud bang and felt sick. She had to sit in the back of the plane. She was so scared she couldn't even get out of her seat. Jen was so embarrassed she never flew again. She was so scared she never flew again. prompt Neil had been journeying through Asia. But he contracted malaria on a Thai island, and had to be flown home for treatment. w/o AttnM He was so sick that he couldn't go to the hospital. He was so sick that he couldn't go to the hospital. He was so sick that he couldn't go to the hospital. He was so sick that he couldn't go to the hospital. He was so sick that he couldn't go to the hospital. w/ AttnM He was very sick and had to be hospitalized. He was in the hospital for a week. He recovered and was released. Neil was very happy to be home. He was able to get better and was able to go back to his home country. Table 11: Example (greedy decoding) of the fine-tuned GPT2-L without and with attention modulation on story completion. Adding attention modulation during inference significantly reduces the sentence-level repetition.    Table 3 in the main text with 5% coverage improvements on all decoding algorithms we tested); Figure 7: Mechanical Turk template used to evaluate CommonGen generations.