LaMemo: Language Modeling with Look-Ahead Memory

Although Transformers with fully connected self-attentions are powerful to model long-term dependencies, they are struggling to scale to long texts with thousands of words in language modeling. One of the solutions is to equip the model with a recurrence memory. However, existing approaches directly reuse hidden states from the previous segment that encodes contexts in a uni-directional way. As a result, this prohibits the memory to dynamically interact with the current context that provides up-to-date information for token prediction. To remedy this issue, we propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens and interpolating with the old memory states to maintain long-term information in the history. LaMemo embraces bi-directional attention and segment recurrence with an additional computation overhead only linearly proportional to the memory length. Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory mechanisms.


Introduction
Language modeling is an important task that tests the ability of modeling long-term dependencies by predicting the current token based on the previous context (Mikolov and Zweig, 2012;Merity et al., 2017).Recently, Transformer-based language models achieved remarkable performance by enabling direct interaction between long-distance word pairs.However, as the computation overhead grows with the length of the input sequence, Transformers can only process a fixed length segment at a time.To allow long-term information flow across individual segments, existing approaches augment the model with a recurrence memory that stores hidden states computed in previous time steps (Dai et al., 2019) and their compressions (Rae et al., 2020;Martins et al., 2021) for the target tokens to attend to.One limitation of this approach is that the recurrence memory is only aware of older contexts since they are previously computed to predict the next word from left to right.As a result, distant memory states become outdated and less activated by the current context, as illustrated in Figure 1.When humans read or write a document, they maintain a memory that records important information from the past and often refresh them under the current context to keep it up-to-date.
In this paper, we propose Look-Ahead Memory (LaMemo) where memory states "look ahead" to future time steps by attending to the token representations on their right side to provide up-to-date contextualization. 3To maintain information from the long-term history, we propose memory interpolation to take both past and future tokens into consideration, which mimics the bi-directional attention.Note that, directly applying bi-directional attention to update the memory representations brings an (2) We propose disentangled relative positional encoding, a simple yet effective solution that disentangles the relative distance and the attention direction that can better generalize to the attention of the future tokens.
(3) We conduct experiments on standard language modeling benchmarks and demonstrate LaMemo's superiority over various baselines equppied with different types of memory mechanisms, despite some having an access to longer contexts.Comprehensive comparisons show the benefits of learning memory representations contextualized with up-to-date information.

Background 2.1 Transformer for Language Modeling
A Transformer (Vaswani et al., 2017) is composed of multiple layers of identical blocks, including a multi-head self-attention (Bahdanau et al., 2015) that calculates pair-wise token interaction and a feed-foward layer for position-wise projection with a non-linear activation.Both two modules are followed by residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) to facilitate optimization.
Given the input sequence representations of the current τ -th segment where N is the target sequence length and d is the hidden state size, they are first mapped into queries Q, keys K and values V by learned weight matrix to compute self-attention: (1) where W q , W k , W v ∈ R d×d are learnable projection matrices.To perform multi-head self-attention, Q, K, V are further split into H heads.For simplicity, we only consider the case of a single head.In language modeling, the attention map is always added by a causal mask to avoid information leakage from the future when predicting the next token: where softmax (•) masks position j > i for the i-th row of the input matrix with −∞ before taking the softmax.The resulted context representations are concatenated and then projected to the final outputs O τ ∈ R N ×d with a learnable projection matrix W o ∈ R d×d .Finally, the self-attention outputs O τ are added by the input representations X τ and fed to the following point-wise non-linear transformation, denoted as f (•): where LN(•) is the layer normalization and FFN(•) is the feed-forward layer, both of which are applied to each row vector individually.The final output of this Transformer layer is f (O τ + X τ ).
Outputs of the final layer are projected to the vocabulary to predict Pr(w t |w 1 , • • • , w t−1 ).The joint probability of predicting the whole segment is the product of these conditional factors.The final objective is to maximize the following loglikelihood:

Recurrence Memory Mechanism
To enable the Transformer to consider more contextual information from previous segments, Dai et al. (2019) proposed to augment the Transformer with a recurrence memory which stores the hidden states of previous time steps as extended keys and values, as shown in Figure 2. Concretely, let us consider a memory length of M and memory representations The extended key and value matrices are obtained by prepend X τ −1 to X τ before projection: where sg(•) stands for stop-gradient which disables gradient propagation to previous segments, and indicates concatenation of hidden states along the length dimension.Extended by the recurrence memory, each query vector can consider contexts even beyond the total context length of the attention M + N .As illustrated by Dai et al. (2019), the effective context length grows linearly to the number of layers and the attention context length due to layer-wise reusing.
Another technique necessary to the recurrence memory is the relative positional encodings.By considering only the relative distance between two tokens when computing the attention score, it avoids temporal confusion caused by indexing the same position across segments and injects useful relative bias.Transformer-XL uses the fixed sinusoidal encoding matrix (Vaswani et al., 2017) to provide relative distance bias and learns global bias terms shared across different layers, which can extrapolate to longer contexts with a great reduction of parameters compared to Shaw et al. (2018): where R is the sinusoid encoding matrix, u, v are learnable weight vectors governing the global content and position bias, and W E k , W R k are separate key projection matrices for the content and position respectively.

Method
In this section, we describe our method in detail with our motivation to learn better representations for the memory.

Look-Ahead Attention
Human language is sequential with one word following another, but humans process information usually in a non-sequential way and recontextualize certain contents for several times.For example, when countering complicated contents during reading, humans usually first store them temporarily in the memory and continue to scan for relevant information if any, and revisit those old contents to refresh their meaning quite often.This dynamic memory refreshing mechanism enables us to thoroughly understand the passage under current contexts.
Existing recurrence memory however, lacks this dynamic contextualization ability.As the representations in the recurrence memory are previously computed conditioned on their past, they are not aware of the current contexts which provide more relevant information for the current token prediction.
To address this limitation, we propose a lookahead attention that allow the memory to attend to the contexts on their right.Formally, we reuse the notation for the representations of the memory.
Let us consider the i-th position of the memory X τ −1 , x i can attend to position x j on its right (j > i) without causing information leakage as long as j ≤ τ + 1.Though appealing, this naïve approach requires to calculate an M by M attention map, which would become inefficient and redundant when M is significantly greater than N .Actually, since the target segment moves forward N positions at each iteration, we devise an incremental manner of look-ahead attention computation that only requires the newest N positions on the right as key-value pairs.
Then the look-ahead attention results computed previously can be effectively reused and interpolated with the current ones ( §3.2).Concretely, we formalize the look-ahead attention as follows: where softmax (•) masks position j ≤ i for the ith row of the input matrix with −∞ before softmax.Q τ −1 is obtained by Eq. ( 1), and the projection matrices of query, key and value are all shared with the causal attention.We illustrate this in Figure 3 where the look-ahead attention (yello paths) increases the attention window of each memory state to M tokens on its right.

Memory Interpolation
To save computations for looking-ahead and effectively reuse the attention results of the past, we propose memory interpolation that smoothly interpolates attention results from both the future and the past to provide bi-directional contextualization.
Recall that in the previous iteration, we have calculated the causal context representations C → τ −1 of X τ −1 using Eq. 2, where each row is a linear s h a 1 _ b a s e 6 4 = " t v I e H X 8 / 8 8 A 1 r 2 6 9 x 4 / 5 k y B b X 8 b C 4 t L y y u r p b X y + s b m 1 r a 5 s 9 t S I p G E N o n g Q n Z 8 r C h n E W 0 C A 0 4 7 s a Q 4 9 D l t + 6 N 6 7 r r H 8 D G 3 + C 0 H a D l S p a P z j l X 9 9 7 j J 4 J r s O 1 v a 2 l 5 The architecture of LaMemo with look-ahead attention and memory interpolation that refresh the memory dynamically with both the current contexts and the long-term history.combination of the weighted token representations of the previous tokens.In Sec.3.1, we describe the look-ahead attention which enables X τ −1 to attend to the contexts on their right and computes C ← τ −1 using Eq. 9. Here, we formulate the memory interpolation as the interpolation between the old representations C → τ −1 and the new ones C ← τ −1 with a coefficient vector α τ −1 ∈ R M controlling the memorization of the past activations: The resulted C ↔ τ −1 which attend to contexts from both directions, are further fed to the non-linear transformation defined in Eq. 3 to update representations in higher layers.
For α τ −1 , we define it to be the sum of the normalized attention weights on the previous tokens when calculating C → τ −1 (Eq.2): where s → τ −1 is the sum of the unnormalized attention score of C → τ −1 , which is the denominator of the softmax in Eq. 2. Similarly, s ← τ −1 is the denominator of the softmax in Eq. 9. ε is a small value to prevent zero division error in practice.Then Eq. 10 can be derived into a form that resembles the bi-directional attention with the queries attending to positions on both sides4 (Appendix A). Figure 4 shows the architecture of LaMemo.
Note that the difference between the hidden state reuse in the recurrence memory and our memory interpolation is that they simply reuse the static representations to extend the contexts for attention while we update the memory representations by aggregating weighted attention sum of the history without the need to recompute them.

Disentangled Relative Positional Encodings
As the look-ahead attention allows the memory to attend to future tokens on its right, we need a relative positional encoding scheme that can generalize to this setting.We start by considering the relative positional encoding in Transformer-XL, as described by Eq. 6.When the i-th query vector attending to a position However, this approach solely relies on the fixed sinusoid encodings to represent the relative distance and the attention direction.We argue that disentangling them is more effective in capturing these two types of temporal biases and also mitigates the numerical unstability issue.Specifically, we propose to learn two direction-aware global position biases to parameterize the sign and query R with the absolute value of the relative distance: where The global positional bias now explicitly separates the contributions of sgn(i − j) and |i − j|, which can better generalize to long distance in both forward and backward directions.
To illustrate the numerical unstability caused by adapting Eq. 6 to j > i, we derive the variance of the dot product x T R i−j where x is a random vector.We show that the variance undergoes an oscillation and cannot be properly bounded everywhere when i shifts from i ≥ j to i < j.Detailed analysis are presented in Appendix B.

Experiments
We evaluate LaMemo on both word-level and character-level language modeling tasks and compare with existing Transformer baselines augmented with different types of memory.

Datasets and Metrics
For word-level language modeling task, we consider Wikitext-103 (Merity et al., 2017), which is the most widely used word-level language modeling benchmark.It contains 103 million tokens for training from 28 thousand wikipedia articles, with an average length of 3.6 thousand tokens per article and a vocabulary size around 260K.We report perplexity (ppl) on the dev and test set.
We also evaluate on two character-level language modeling benchmarks enwik8 and text8 (Mahoney, 2011).Both datasets contain 100 million Wikipedia characters.While enwik8 is unprocessed, text8 is preprocessed by case lowering and filtering to include only 26 letters from a to z and space.On both datasets, we report bit per character (bpc) on the dev and test set.

Baselines
To directly compare with different types of memory, we consider Transformer-XL and its variations with the same model architecture but different memory mechanism.
Transformer+RPE is the vanilla Transformer (Vaswani et al., 2017)  Table 1: Word-level language modeling results on Wikitext-103.We report ppl (perplexity) on dev and test set.We also report the number of parameters, memory size, external memory size, and the number of FLOPS (floatingpoint operations) for computing one step prediction on average.

Implementation Details
We follow the standard architecture of the Transformer-XL (Dai et al., 2019) that has different configurations for different tasks.Specifically, on Wikitext-103, we use a 16-layer Transformer with 10 attention heads and head dimension 41 equipped with adaptive embeddings (Baevski and Auli, 2019).We control the target sequence length to be 150 and the memory length 150 for all models following the setting of Dai et al. (2019).For the Compressive Transformer and ∞-former, we additionally use an external memory of size 150 following the setting of Martins et al. ( 2021). 5 On the text8 and enwik8 datasets, we use a 12-layer Transformer with 8 heads and head dimension 64.The length of the target sequence and the recurrence memory are both set to 512.In the main results we use the identical evaluation setting to the training phase on all datasets and do not use a longer memory.We use the Pytorch framework (Paszke et al., 2019) and Apex for mixed-precision training.
In practice, we found that calculating the exponentials ( §3.2) may lead to numerical overflow in mixed-precision mode, so we compute the logarithm of the exponential sum using logsumexp and logaddexp operator.Further details of the dataset and the hyperparameter settings are described in the Appendix C.

Main Results
We show the results of word-level language modeling benchmark Wikitext-103 in  compared to the compressive memory and the unbounded memory that take longer contexts into account, LaMemo still achieves lower perplexity.This indicates that the look-ahead memory allows the language model to exploit the recent contexts to gain performance, while simply increasing the context length yields marginal improvement.This is in accordance with previous findings of how language models utilize contexts (Khandelwal et al., 2018;Sun et al., 2021).In terms of the parameters, LaMemo has the same number of parameters as the Transformer-XL while other baselines use additional parameters in CNN to compress or smooth the hidden states.Lastly, we show the number of FLOPS necessary for computing one step prediction.∞-former has the highest number of FLOPS for resampling enough points from the continuous signal to update the memory using smoothing techniques.LaMemo also incurs additional computations to re-contextualize the memory under the current context.Note that although the Compressive Transformer has lower number of FLOPS than LaMemo, it has an external memory that consumes more GPU memory.
We also present the results of character-level language modeling on text8 and enwik8 datasets in Table 2.We observe similar trends as the results on the word-level benchmark, where LaMemo outperforms Transformer-XL by 0.04 on text8 and 0.02 on enwik8 with the same context length.Additionally, we observe that all models exhibit overfitting on text8, which might be caused by the extremely small vocabulary size of the dataset.

Ablation Study
We conduct ablation studies on Wikitext-103 to examine the effects of the proposed techniques, i.e., look-ahead attention, memory interpolation, and disentangled relative positional encodings.
We use the same model achitecture and the same target and memory length as the main results.We first study three configurations, including (1) using the Full model setting, (2) ablating the memory interpolation module (w/o mem interp), i.e., set the memorizing coeffecient α τ −1 = 0, and (3) ablating the look-ahead attention (w/o look-ahead), i.e., only use the causal context representations C → τ −1 in each layer.As shown in the First three rows in Table 3, both the memory interpolation and the look-ahead attention are indispensible for achieving the best performance.Additionaly, we found that cancelling out memory interpolation leads to a worse performance, which indicates that the distant past still provides additional information beyond the current context.
The second study targets at studying different encoding schemes.We substitute our encodings with the RPE of Transformer-XL Dai et al. (2019) and run multiple experiments with 3 different random seeds, but all the models fail to converge.We plot the training curves using two encodings in Figure 8 in Appendix B, where we observe that our disentangled RPE is more stable during training and achieves lower perplexity.

Extrapolating to Longer Contexts
In this section, we extrapolate the models to longer contexts during inference to study the effect of dynamic contextualization to the distant past.We fix the length of the target sequence to 64 and extrapolate the trained models to longer memory length 64 × m during inference, where m = 1, • • • , 10.We compare the perplexity of LaMemo and Transformer-XL trained on Wikitext-103 when augmented by a memory with different length.As shown in Figure 5, LaMemo consistently achieves lower perplexity than Transformer-XL when extraploating to longer contexts, while the performance of both models saturate when m is over 7. Additionally, we observe that the gap of perplexity between the two models increases when taking longer contexts into account.This demonstrates the effectiveness of dynamically refreshing the distant memory representations under the current context.
In this section, we analyze the attention distribution of LaMemo to validate the effectiveness of utilizing bi-directional contexts with look-ahead attention.
We first visualize the memorizing coefficient α which stands for the portion of the past activations in the current memory representations.As show in Figure 6, we plot α in different layers as a function of the memory index averaged on 100 text segments. 6We observe that in lower layers the memory mainly attends to the past (α ≈ 1.0).We conjecture that long-term bi-directionality is not necessary for low-level representations such as lexical features.In higher layers, the memory substantially utilizes the future contents to refresh the high-level representations, especially for the old memory state with a small memory index.
Next, we visualize the attention weight distribution on the context tokens when predicting each target token in Figure 1.For every token, we take the maximal attention weight in each interval of 5 tokens on its left and scale to a context length of 100.The result indicates that LaMemo learns better memory represetations by attending to the right-side tokens, which increases the memory utilization when predicting the target token.

Case Study
We present the generated texts of LaMemo and Transformer-XL trained on Wikitext-103 in Appendix D. Both models maintain a memory size of 512, and we seed them with the same context randomly sampled from the test set and generate 256 tokens using top-p sampling (Holtzman et al., 2020) with p = 0.95.

Related Work
The Transformer (Vaswani et al., 2017), with its pair-wise modeling ability of the input, becomes prevailing for sequence modeling, especially long sequence processing tasks, such as long text generation (Tan et al., 2021;Ji and Huang, 2021), long document QA (Beltagy et al., 2020;Ainslie et al., 2020), language modeling (Dai et al., 2019;Rae et al., 2020), video processing (Wu et al., 2019), and etc.Specifically, language modeling (Merity et al., 2017) which requires processing documents with thousands of tokens has become a natural testbed for benchmarking this long-term processing ability.However, due to the quadratic time and space complexity of self-attention, scaling to inputs with thousands of tokens is computationally prohibitive.
One line of work investigated the linear-time attention mechanism to mitigate the scability issue of Transformer.Linformer (Wang et al., 2020) projects the inputs to lower dimension in length and approximates the full attention with a low-rank factorization.Linear Transformer (Katharopoulos et al., 2020) regards the self-attention as a kernel function and uses a linear dot-product as a substitute.Choromanski et al. ( 2021) and Peng et al. (2021) proposed to approximate the softmax more precisely with the expectation of the dot-product of random features.Although achieving substantial improvements on benchmarks designated for long inputs (Tay et al., 2021).These methods, however, focus on approximating the full attention with low-rank factorizations or kernel functions, which compromise the expressiveness and robustness of the original softmax attention, are reported to be inferior to the simple local attentions on real world language processing tasks (Xiong et al., 2021).
Our work falls in another line, which augments the Transformer with a parametrized memory to store critical history information.Memoryaugmented networks (Graves et al., 2014;Weston et al., 2015;Sukhbaatar et al., 2015) have been studied in the context of recurrent neural networks for a long time, but are mostly restricted to small and synthetic datasets.With the rapid development of Transformer, various works start to adapt memories to this architecture.Dai et al. (2019) first extended Transformer with a recurrence memory that caches hidden states computed in previous steps for the target tokens to attend to.Rae et al. (2020) further extended the context with an external memory that stores compressed hidden states at the temporal level.Martins et al. ( 2021) used continuous space attention to attend over the old history and updated the memory with recent hidden states to enable unbounded memory capacity.Wu et al. (2021) proposed to use the encoder-decoder architecture to encode the memory states with previous text segments and pass this memory to future time steps.Instead of using a fixed-size attention span for different layers, Sukhbaatar et al. (2019) andCorreia et al. (2019) proposed to learn dynamic attention spans for dif-ferent attention heads, which greatly reduced the computations.These works focused on enabling the Transformer to access contents in long distance, but did not consider to learn better memory representations by refreshing the old memory under the current context.Our work is orthogonal to learning adaptive attention spans and can be combined with this technique to reduce the complexity.

Conclusion
We present LaMemo, a memory mechanism that allows the memory states to incrementally attend to the right-side tokens and interpolates with the old memory states on the left side, which enables the memory to interact with bi-directional contexts with a complexity linear in memory length.Experiments on three language modeling datasets demonstrate the superiority of LaMemo over baselines with various types of memory mechanisms.We also found that LaMemo increases the utilization of older memory states when predicting the target tokens, and yields a higher performance boost when extrapolating to longer memory length, which indicates the effectiveness of recontextualizing the memory under the current context.

A Derivation of Memory Interpolation
We derive Eq. 10 into the form of standard selfattention in the following: We consider the i-th row of C ↔ τ −1 , denoted as c ↔ i .We omit the stop-grad operation sg(•) and substitute α with the result from Eq. 11: where s → i , s ← i is the denominator of the softmax when computing c → i , c ← i respectively: where (q i , k j ) and (q i , k j ) are two sets of querykey vectors computed in the previous and this text segment respectively for the same position pair (i, j) .Then we have: j≤i sim(q i , k j ) + j>i sim(q i , k j ) j≤i sim(q i , k j ) + j>i sim(q i , k j ) where j β j = 1.Finally, we derive c ↔ i as the weighted sum of the value vectors ṽj from both the past (j ≤ i) and the future (j > i) of the position i.

B Unstability Analysis of the RPE in Transformer-XL
We conjecture that the unstability of Eq. 6 stems from the terms involving the dot-product of R i−j and another vector.So we start by considering the variance of x R i−j where x ∈ R d is a random vector.Without loss of generality, we assume that x has zero mean and a variance of σ: Vaswani et al. (2017), R ∆ takes the following form: where w k = 10000 −2k/d .Then the dot-product x R ∆ can be derived into the linear combination of sine and cosine functions: where we can easily derive that E(x R ∆ ) = 0.
According to the variance-expectation formula: , we can simplify the variance Var(x R ∆ ) in the following: We further simplify the above equation by assuming that all the elements have the same variance σ s , and all pairs of distinct elements have the same covariance σ c : where g(x) = d/2 k=1 d/2 l=1,l =k sin(ω k x) cos(ω l x) is an odd function.
We consider the value of g(x) when x ≈ 0. Figure 7: The plot of g(x) when d = 64.We see that g(x) is symmetric with respect to the origin.The value of g(x) when x approaches zero from the left and right diverge greatly.
Since sin(ω k x) ≈ ω k x, cos(ω k x) ≈ 1, we have: Since a x ≈ 1 + x ln a when x ≈ 0, we derive that γ d ≈ d 2 2 ln 10 8 with the grow of d.This causes g(x) to have a very steep slope near 0. Since g(x) is an odd function, the value of g(∆) and g(−∆) will have a huge gap (∆ is a small positive value).To validate this, we plot the function of g(x) when d = 64 in Figure 7.
Overall, the variance of x R ∆ is composed of two terms, the first being σ s multiplied by a constant factor d/2, and the second being σ c multiplied by g(∆).Note that σ s is strictly positive, while σ c does not have this restriction.Due the asymptotic behavior of g(∆) near 0, i.e., O(d 2 ∆), we cannot find a proper σ c that makes Var(x R ∆ ) bounded by O(dσ s ) for every ∆ that takes its value from both the positive and negative integers.
Finally, we plot the training curves of the two models using the RPE in Transformer-XL (xl-rpe) and our disentangled RPE (dis-rpe) in Figure 8 where we observed that the xl-rpe suffers from numerical unstability during training.text8 dataset contains the first 100 million bytes of the clean text of Wikipedia that retains only regular articles and image captions.All the letters are converted into lower case, and only letters in the 27 character alphabet, namely letters a-z and nonconsecutive spaces, are preserved.This dataset is licensed under the CC BY-SA License.
The statistics of the three datasets is shown in Table 4.

C.2 Model Configurations
We follow the base model configuration of Dai et al. (2019).On Wikitext-103, we use the Transformer model with 16 layers, 10 attention heads with a head dimension of 41.The inner dimension size of the feedforward layer is 2100.We use a dropout rate of 0.1 and no attention dropout.To cope with the large vocabulary, we use the adaptive embeddings (Baevski and Auli, 2019).We set the memory length to 150 and the target sequence length to 150 as well.On text8 and enwik8 datasets, we use the Transformer model with 12 layers, 8 attention heads with a head dimension of 64.The inner dimension size of the feedforward layer is 2048.We use a dropout rate of 0.1 and no attention dropout.We set the memory length to 512 and the target length to 512.Specifically, our LaMemo uses the disentangled relative positional encodings described in Sec.3.3.The look-ahead attention shares the query, key and value projection matrices with those in the causal attention.

C.3 Training Settings
We trained the models using Adam (Kingma and Ba, 2015) optimizer, with no warmup.We used a learning rate of 2.5 × 10 −4 which decayed to 0 at the end of training with a cosine schedule.On Wikitext-103, we trained the model with 250K steps using a batch size of 64.On enwik8 and text8, we trained the model with 100K 7 steps using a batch size of 40.We conducted our experiments on 2 Tesla V100.

C.4 Hyperparameters
We present the hyperparameter search space in Table 5.The number of hyperparameter search trials was 10.We adopted a manual search to select the hyperparameters, and the selection criterion was ppl/bpc on the dev set.We did not use early stopping during training.

D Generated Examples
In this section, we present the examples generated by LaMemo and Transformer-XL trained on the Wikitext-103 dataset.Both models maintain a memory with a length of 512.We randomly select a piece of text from the test set as the context 7 We used a smaller number of training steps compared to Dai et al. (2019)  and allow both models to generate 256 tokens following the context.We use top-p sampling with p = 0.95 and detokenize the context and the generated texts to facilitate reading.We present the exmples in Table 6 and 7. We present our major findings below: • Both models are able to hallucinate imaginary contents fairly relevant to the limited contexts given as prompts.
• Transformer-XL sometimes generates topicirrelevant contents without further elaboration (marked by underline), while LaMemo stays on topic more closely during the course of generation.
• Transformer-XL suffers more sever repetition issues (marked in boldface) than LaMemo both lexically and semantically.
Context: = Shackleton ( crater ) = Shackleton is an impact crater that lies at the south pole of the Moon.The peaks along the crater's rim are exposed to almost continual sunlight, while the interior is perpetually in shadow (a Crater of eternal darkness).The low-temperature interior of this crater functions as a cold trap that may capture and freeze volatiles shed during comet impacts on the Moon.Measurements by the Lunar Prospector spacecraft showed higher than normal amounts of hydrogen within the crater, which may indicate the presence of water ice.The crater is named after Antarctic explorer Ernest Shackleton.= = Description = = The rotational axis of the Moon lies within Shackleton, only a few kilometers from its center.The crater is 21 km in diameter and 4.2 km deep.From the Earth, it is viewed edge-on in a region of rough, cratered terrain.It is located within the South Pole-Aitken basin on a massif.The rim is slightly raised about the surrounding surface and it has an outer rampart that has been only lightly impacted.No significant craters intersect the rim, and it is sloped about 1.5 °toward the direction 50 -90 °from the Earth.The age of the crater is about 3.6 billion years and it has been in the proximity of the south lunar pole for at least the last two billion years.
Because the orbit of the Moon is tilted only 5 °from the ecliptic, the interior of this crater lies in perpetual darkness.Estimates of the area in permanent shadow were obtained from Earth-based radar studies.Peaks along the rim of the crater are almost continually illuminated by sunlight , spending about 80 -90 % of each lunar orbit exposed to the Sun.Continuously illuminated mountains have been termed peaks of eternal light and have been predicted to exist since the 1900s.
The shadowed portion of the crater was imaged with the Terrain Camera of the Japanese SELENE spacecraft using the illumination of sunlight reflected off the rim.The interior of the crater consists of a symmetrical 30 °slope that leads down to a 6.6 km diameter floor.The handful of craters along the interior span no more than a few LaMemo Generation: kilometres of plains.A slight rotation is observed in the crater that now bears significant elements like snow, ice, and water ice.In other areas of the crater there are indications that the nature of the alpine ice layer provides some protection to the craters by increasing their brightness.The craters seem to be cut by an ice thickness of 10 cm (4 in).Images show their prominence, varying from a visible plateau to a multi-ring topography, indicating that they are present (a major component) in the craters.Their interior has a considerable surface area of dry ice.As seen from Earth, the initial craters are marked by slow-moving metal-bearing mudstone-covered rock that has a surface of substantial amounts of coal, calcium, lead and sodium alloy, carbon monoxide, nitrogen oxides, and silicon dioxide.All are developed by processes such as heat transfer by the advancing radiation pressure.The least dense feature along the surface of the Moon is the large resurfacing event.There is no view that these rocks remain melting to about 90 °C (150 °F), but that their surface shine is dark and their outer layers are light, reflecting vertical oil.This late flank and early 1970s discovery was announced in 1982, as a substantial amount of ice.

A parallel lava
Transformer-XL Generation: kilometres.
During the hundreds of years after its formation, the crater now has significant seismic activity in the vicinity.Most craters are dome-shaped and are usually constructed of sedimentary materials that have been deposited during alpine ice times.The upper surface of the crater is covered with higher layers of rock that are cut by an extensive thickness of debris.The debris reaches the top of a mountain and comes in contact with visible ground planes.People are often observed wearing headgear of degrading materials such as clothing and boots, their shoes or hats, or even working on the surface.Below the crater.As the crater faces the crater it has thick, thin pipes or scarps.A total of more than 200 caves have been excavated, down to some 40 m by 20 m.This exceeding the margin of the crater where it actually passes through is considered to be very high.Other geologic features by the advancing magnetic field have been reported from the crater.However, in 1992, scientists announced they would study this area again.The crater was once a common feature of the Post Lunar System.Its medieval boundaries were not fixed in the orbital plane of Mercury.An individual crater had been called " Discovery crater " and one referred to as " Bear crater ", although it is likely that an additional crater was called

Figure 1 :
Figure 1: Attention weights on the context (in log-scale) in the final layer of Transformer-XL and LaMemo averaged on 15K tokens.Transformer-XL quickly loses attention to older contexts, while LaMemo maintains awareness to the history with the grow of the context length.

Figure 2 :
Figure 2: The architecture of Transformer-XL augmenting with a recurrence memory.

Figure 3 :
Figure 3: Illustration of LaMemo with a memory length M = 2 and a target sequence length N = 1 for clarity.Solid lines stand for the attention connections computed at this iteration while dashed lines represent the previously computed attention.
As defined byVaswani et al. (2017), R ∆ ∈ R D is composed of sine and cosine functions with different frequencies.Since the sine function is odd, sin(−ω∆) = − sin(ω∆), we have R −∆ = R ∆ so that it can represent attention in different directions (± sign of ∆) with the same relative distance (absolute value of ∆).

Figure 5 :
Figure 5: Test perplexity of LaMemo and Transformer-XL when extrapolating to longer contexts during inference, where m is the ratio of the memory length to the target length.

Figure 6 :
Figure 6: The memorizing coefficient α of different layers in a 16-layer model with a same memory and target length of 150.Smaller index means older memory.

Figure 8 :
Figure 8: Comparison of the training dynamics using different encoding schemes: the disentangled RPE (disrpe) and the RPE of Transformer-XL (xl-rpe).
(Rae et al., 2020) positional encodings fromDai et al. (2019)but does not extend the context with additional memory.Transformer-XL(Dai et al., 2019)is a Transformer model equipped with relative positional encodings and a recurrence memory comprised of hidden states computed in previous time steps to extend the context length of the attention.Compressive Transformer(Rae et al., 2020) extends Transformer-XL with an external compressive memory that stores compressed hidden states at the temporal level using convolutional networks.∞-former(Martins et al., 2021) uses continuous space attention to attend over the external memory which consists of continuous signals.They also updated the external memory with recent hidden states to enable unbounded memory capacity.

Table 2 :
Character-level language modeling results on text8 and enwik8.We report bpc (bits-per-character) on the dev and test set.

Table 4 :
Dai et al. (2019)17)tasets used in the experiments.For Wikitext-103, we use the official split fromMerity et al. (2017)and present the number of tokens in each split.For enwik8 and text8, we use the split fromDai et al. (2019)and report the number of characters for each split.
, since it would take too long to train one model.

Table 5 :
Hyperparameter search space.choice indicates that the listed numbers will be chosen with the same probability.Best-found hyperparameters are in boldface.