ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state- of-the-art results on a variety of natural language processing (NLP) tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights—an established approach (Wang et al., 2020b) previously thought to not be applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.


Introduction
Transformer architectures are now central in natural language processing (Vaswani et al., 2017). They rely on the attention mechanism (Bahdanau et al., 2015) to contextualize the input. The context can be seen as a random access memory whose size linearly grows with the sequence length; each query * This work was done while Zhaofeng Wu and Nikolaos Pappas were at the University of Washington. reads from it using a softmax-normalized linear combination, with overhead linear in the memory size. This amounts to a quadratic complexity overall, making transformers' computational overhead prohibitive, especially for long sequences.
One way to improve attention's efficiency is to bound its memory size. Imposing a constantsized constraint over the memory ensures that reading from it has constant time and space overhead, yielding a linear overall complexity in sequence lengths. This is in fact a common strategy adopted by several recent works. In this work, we show that some of these works are closely connected in ways that, to date, have gone unremarked. We propose attention with bounded-memory control (ABC), a unified abstraction over them. In ABC, constant-sized memories are organized with various control strategies, e.g., induced from heuristic patterns (Beltagy et al., 2020;Zaheer et al., 2020;Ainslie et al., 2020;Rae et al., 2020, inter alia), locality assumptions (Parmar et al., 2018;Liu et al., 2018), or positions (Wang et al., 2020b. These strategies, by and large, are "contextagnostic." In response to this, we propose ABC MLP , a particular instance of ABC that learns a contextualized control strategy from data. Specifically, ABC MLP uses a neural network to determine how to store each token into the memory (if at all). Compared to previous bounded-memory models, it strikes a better trade-off between accuracy and efficiency: controlling for the accuracy, ABC MLP can get away with much smaller memory sizes.
ABC models (including ABC MLP ) come with a linear complexity in sequence lengths, and admit recurrent computation graphs in causal attention (self-attention over the prefix). Therefore they are appealing choices in a variety of applications, including text encoding, language modeling and text generation. This leads to a surprising finding. Linformer (Wang et al., 2020b), an established efficient attention method, was previously thought not to be applicable in causal attention or autoregressive decoding (Tay et al., 2020). Through the ABC view, we show that it actually is, and achieves competitive performance in our machine translation experiments.
ABC connects existing models that would otherwise seem distinct, reveals new insights into established methods, and inspires new efficient attention architectures. We explore its applications in transformers, as a drop-in substitute for the canonical softmax attention. ABC offers a novel lens that can help future research in the analysis of transformers, where the theoretical insights are still catching up with empirical success. Experiments on language modeling, machine translation, and masked language model finetuning show that our ABC MLP model outperforms previous ABC approaches in accuracy with a much smaller memory size. Compared to the strong transformer baseline, ABC MLP achieves a significant speedup and memory savings at inference time, with no or negligible accuracy loss. The efficiency improvements are more prominent for long sequences, suggesting that the asymptotic savings are even more appealing in applications involving long sequences. We release our code at https://github.com/Noahs-ARK/ABC.

An Outer-Product View of Attention
This section presents our outer-product memory perspective of attention, which allows for a smooth transition to later discussion.
In attention, a sequence of queries {q i } N i=1 attend to a memory with N slots, each storing a key and value pair: K = [k 1 , . . . , k N ] ⊤ , V = [v 1 , . . . , v N ] ⊤ ∈ R N ×d . 1 Query q reads from the memory using a softmax-normalized linear combination, producing a d-dimensional vector: This takes O(N ) time and space. When the attention with N queries can be parallelized (e.g., in text encoding), it takes linear time and quadratic space; when it cannot be (e.g., in decoding), it takes quadratic time and linear space.
The memory can be equivalently represented as sums of vector outer products: I is the identity matrix, and ⊗ denotes the outer product: [x ⊗ 1 The number of queries and key-value pairs may differ, e.g., in the cross attention of a sequence-to-sequence model. y] i,j = x i y j . N -dimensional vectors {e i } form the standard basis: e i has the ith element being one and others zeros. We can view e i as control vectors that determine where to store k i and v i : The N -by-d matrix on the last line has its ith row being k ⊤ i and all others zeros; in this sense, k i is stored in the ith slot by e i , not affecting others.

Attention with Bounded Memory
A straightforward way to improve attention's efficiency is to bound its memory size. Our outerproduct view of attention provides a straightforward way to devise this, by replacing {e i } with control vectors that select n ≪ N vectors to attend to. We dub this approach attention with bounded-memory control (ABC). Concretely, let K, V ∈ R n×d denote a constant-size memory with n slots, with n set a priori.
{ϕ i ∈ R n } N i=1 denotes a sequence of control vectors. The output is calculated by attending to K and V: We will discuss various ways to construct {ϕ i } in the subsequent sections. Reading from the memory takes a constant O(n) time and space; therefore ABC's overall complexity is O(N n), linear in the sequence length. 2 Eq. 3 offers an equivalent recurrent computation, which is particularly useful in causal attention where only the prefix is looked at, likewise for V t . K t and V t can be seen as the recurrent hidden state that encodes the prefix.
In what follows, we study several existing efficient attention approaches and show that they are in fact instances of the ABC abstraction.

Linformer
Linformer (Wang et al., 2020b) is an established efficient transformer variant that has proven successful in masked language modeling and text encoding. It assumes fixed-length inputs and learns a low-rank approximation of the attention weights. A learned n-by-N matrix W LF down projects the N -by-d dimensional keys and values along the timestep dimension, to an n-by-d memory: K LF = W LF K, V LF = W LF V; they are then used for attention computation with Eq. 4. This yields a linear complexity in the input length. Linformer is an ABC instance with ϕ LF i = W LF :,i (ith column), and in this sense, it learns a control vector for each position.
Previous works have noted that Linformer cannot be efficiently applied in causal attention (Table  1 of Tay et al., 2020). Indeed, it is less straightforward to avoid mixing future with the past when projecting along the timestep dimension. ABC reveals that, in fact, Linformer is applicable in causal attention. Like all ABC models, it admits a linear-complexity recurrent computation (Eq. 5): This confirms ABC's benefits: it reveals new insights about existing models and reassesses their applications and impact. Our experiments show that Linformer achieves competitive performance in machine translation.

Clustering-Based Attention
Improving attention's efficiency with clustering has received an increasing amount of interest (Kitaev et al., 2020;Roy et al., 2020;Wang et al., 2020a, inter alia). ABC bears interesting connections to clustering-based methods. Here we discuss an approach that closely follows Vyas et al. (2020), except that it clusters keys and values instead of queries, and only attends to the centroids to reduce the effective context size. Formally, keys and values are grouped into n < N clusters 3 Let an N -by-n binary matrix M denote the cluster membership shared between keys and values. M i,j = 1 iff. k i is assigned to cluster k CL j and v i to v CL j . The jth centroid for the keys is 3 We use k CL j to denote both the jth cluster and its centroid.
likewise for the values. It then attends over the centroids using Eq. 4, with The last line indicates that this model is an instance of ABC: The stack of centroids can be seen as the constant-size memory. Putting aside the clustering overhead (i.e., constructing M and computing centroids), it has a linear complexity in the sequence length.

Sliding-Window Attention
In some applications, being able to remove entries from the memory can be beneficial: clearing up older context frees slots for more recent ones, promoting a locality inductive bias. ABC offers the capability to do so, if augmented with an additional matrix multiplication. We use the sliding-window attention as an example.
Attending to the most recent n input tokens (Beltagy et al., 2020;Zaheer et al., 2020;Sukhbaatar et al., 2021, inter alia) can be seen as a firstin-first-out queue that "pops" out the oldest token while "pushing" in the most recent one: The pop operation can be achieved by multiplying an n-by-n upper shift matrix: U i,j = δ i+1,j , with δ being the Kronecker delta (i.e., U has ones only on the superdiagonal and zeros elsewhere). Left-multiplying U against K WD t shifts its rows one position up, with zeros appearing in the last: Then the most recent token can be put into the slot freed up: K WD t+1 = U K WD t + e n ⊗ k t+1 . U and ϕ t = e n ensure a first-in-first-out queue. Dilated and stride convolution patterns (Beltagy et al., 2020) can be similarly recovered ( §A.4).
Recurrently multiplying U simulates the discrete pop operation (Grefenstette et al., 2015;Joulin and Mikolov, 2015;Yogatama et al., 2018) in a differentiable way. This is reminiscent of recurrent neural networks, while in this case U is never updated as parameters. It is exciting to explore learning U, but is beyond the scope of this work.
Discussion. Besides the models discussed above, certain variants of Rae et al. (2020) and sparse attention patterns (local-to-global attention;Beltagy et al., 2020;Zaheer et al., 2020;Ainslie et al., 2020) can also be seen as instances of ABC ( §A). ABC provides a unified perspective of them, and at the same time points out their limitations: their control strategies are context-agnostic. In response to this, in §4 we propose to learn a contextualized strategy from data. Table 1 analyzes various ABC models, and Table 2 details their complexity.

Learned Memory Control
The ABC abstraction connects several existing approaches that would otherwise seem distinct. This inspires the design of new architectures. We hypothesize that learning a contextualized strategy can achieve better performance. This section introduces ABC MLP . It parameterizes ϕ with a singlelayer multi-layer perceptron (MLP) that takes as input the token's representation x i , and determines which slots to write it into and how much.
Matrix W ϕ is learned. exp is an elementwise activation function. The motivation is to allow for storing a "fractional" (but never negative) amount of input into the memory. 4 Using a non-negative activation, however, has a drawback: the scales of i ϕ i ⊗ k i and i ϕ i ⊗ v i would grow with the sequence lengths, making training less stable. To overcome this, we divide α i vectors by their sum. This functions as normalization and aims to offset the impact of varying sequence lengths. 5 It admits the recurrent computation graph as in Eq. 5, and has a linear complexity in the sequence length.
A key design choice of ABC MLP is that its ϕ i depends only on current input x i . This helps (1) keep the recurrent computation efficient in practice (Lei et al., 2018), and (2) make it applicable 4 We experiment with other activations in §C.2. 5 Here encoder self-attention or cross attention is assumed, and the normalization sums over the entire sequence. Causal attention is slightly different, normalizing by the sum over the prefix instead: ϕi = αi/ i j=1 αj. This does not require access to future tokens. §B.1 details a linear complexity computation graph of causal ϕi.
in not only encoder self-attention and cross attention, but also causal attention. Concurrently to this work, Goyal et al. (2021) and Ma et al. (2021) also proposed methods to learn contextualized control. They compute ϕ i from previous layer's memory, revealing the full sequence to the control vectors. As a result, these two approaches are unsuitable for causal attention. 6 ABC MLP , as other ABC models, can be used as a drop-in replacement for the canonical softmax attention, and we apply its multihead variant in transformers. With proper parameter sharing, the number of additional parameters ABC MLP incurs is small: inspired by Wang et al. (2020b), we tie ϕ-MLP's parameters across different layers, which adds less than 1% parameters to the models.

ABC MLP :
context-agnostic then contextdependent attention. We now dissect ABC MLP and show that it can be seen as a cascade of two attention mechanisms: one with a learned context-agnostic "pseudo query" followed by one with a context-dependent query. Our analysis starts with a one-dimensional example; the conclusion generalizes to higher-dimensional cases.
Example 1. Consider ABC MLP with a single memory slot (n = 1). It is parameterized with a learned vector w ϕ , and In other words, K uses w ϕ as a "pseudo-query" to attend to . Despite its similarity to the standard softmax attention, Example 1 has a more efficient linear complexity in sequence lengths. w ϕ 's being context-independent is the key to the savings. Table 2 details its complexity.
Example 1's conclusion generalizes to higherdimensional cases: the jth dimension of {ϕ i } attends to {x i } and {k i } using the jth row of W ϕ as the context-independent pseudo-query; n such attention mechanisms run in parallel, stacking the Model Section ϕ t Mem. Control  Table 2: ABC's time and space complexity in sequence length against the softmax attention's. "Mem." indicates the time and space needed for calculating and storing memory K, V. N denotes the sequence length, and n the memory size. The time complexity analysis assumes that the softmax attention cannot be parallelized across the queries. In practice, this is common in autoregressive decoding or for long sequences where the accelerators (e.g., GPUs) do not have enough threads to fully parallelize softmax attention's computation across different queries. results into n-by-d memory K and V. Intuitively, it is the "real queries" {q i } that encode "what information is useful for the prediction task." Without access to them, ABC MLP summarizes the input for n times using different pseudo-queries, aiming to preserve enough information in the memory for onward computation. The attention output is calculated with the context-dependent real queries using Eq. 4. §B.2 presents a detailed derivation.
Connections to other prior works. Although starting from distinct motivations, ABC MLP closely relates to hierarchical attention (HA; Yang et al., 2016). HA summarizes the context into higherlevel representations with a cascade of attention mechanisms, e.g., words to sentences, and then to documents. ABC MLP applies two types of attention.
The first learns context-agnostic pseudo-queries and attends to the same sequence for n times in parallel, while the second retrieves from the memory with real queries. HA, in contrast, summarizes non-overlapping segments at each level. The learned pseudo-queries closely relate to the inducing point method in set attention (ISA; Lee et al., 2019). ISA applies a non-linear feedforward network between a cascade of two attention mod-ules. This precludes the outer-product memory computation and efficient recurrences in ABC.
Another line of work "linearizes" attention through kernel tricks and also applies bounded memory: their feature map dimensions are analogous to memory sizes. They substitute the softmax with approximations (Peng et al., 2021;Choromanski et al., 2021), heuristically designed (Katharopoulos et al., 2020;Schlag et al., 2021), or learned (Kasai et al., 2021b functions. ABC MLP keeps the softmax, but over a smaller constant-sized context. This can be useful in practice: (1) ABC provides a unified perspective of several efficient attention methods, allowing for borrowing from existing wisdom to design new architectures; (2) it draws a close analogy to the canonical softmax attention, and is better-suited as its drop-in substitute in various application settings, as we will show in the experiments; (3) empirically, we find that ABC MLP can get away with a much smaller memory size to retain the accuracy. Peng et al. (2021) and Schlag et al. (2021) use gating to promote recency bias. The same technique is equally applicable in ABC models.
The learned contextualized memory control is reminiscent of the content-based addressing in neu-ral Turing machines (NTM; Graves et al., 2014). ABC MLP computes the control vectors {ϕ i } as a function of the input, but not of the memory as in NTM. This ensures that the control vectors at different timesteps can be computed in parallel, improving the time efficiency in practice (Lei et al., 2018;Peng et al., 2018). Analogies between memory and neural architectures are also made by other previous works (Hochreiter and Schmidhuber, 1997;Weston et al., 2015;Le et al., 2020, inter alia).

Experiments
We evaluate ABC models on language modeling ( §5.1), sentence-level and document-level machine translation ( §5.2), and masked language model finetuning ( §5.3). Dataset statistics and implementation details are summarized in §C.

Language Modeling
Setting. We experiment with WikiText-103, sampled text from English Wikipedia (Merity et al., 2017). The BASE model with standard softmax attention is the strong transformer-based language model by Baevski and Auli (2019). We compare the following ABC variants, which build on BASE, but replace the softmax attention with linearcomplexity bounded-memory attention alternatives while keeping other components the same. • ABC MLP , as described in §4, learns a contextualized exp-MLP as the ϕ function. • Linformer ( §3.1; Wang et al., 2020b). • ABC RD stores each token in a randomly-selected memory slot with ϕ t = e it . i t is uniformly drawn from {1, . . . , n} at each time step. This helps us quantify the differences between random and learned bounded-memory controls. We consider two model size settings: • 16 layers (Baevski and Auli, 2019  ABC models, ABC MLP achieves the best performance for both context sizes. With a memory size n = 64, ABC MLP outperforms both Linformer and ABC RD by more than 2.9 test perplexity; and the gap is larger with the longer 480-length context: more than 3.6 test perplexity. ABC MLP -32 outperforms its larger-memory ABC counterparts by more than 2.1 test perplexity. These results confirm ABC MLP 's advantages of using a contextualized strategy. Surprisingly, Linformer underperforms ABC RD , and its performance drops with the larger 480-length context window. This suggests that, while successful in text encoding, Linformer's position-based strategy is a suboptimal design choice for causal attention, at least for long context. All ABC models underperform the BASE, with ABC MLP -64 having the smallest gap of 0.5 perplexity. ABC MLP -32 outperforms kernel-based methods by more than 0.9 test perplexity, using Kasai et al. (2021b)'s 32-layer setting (Table 3b).

Machine Translation
Datasets. To assess their performance over various output lengths, we compare ABC models on sentence-and document-level machine translation.   Table 4a summarizes sentence-level machine translation results on the WMT14 EN-DE test set. Overall ABC MLP performs on par with BASE, with either 32-32 cross-causal memory sizes or 32-8. Even with smaller memory sizes, it outperforms other ABC variants by more than 1.1 BLEU. Differently from the trend in the language modeling experiment ( §5.1), Linformer outperforms ABC RD by more than 0.5 BLEU. We attribute this to the smaller sequence lengths of this dataset. ABC MLP outperforms other ABC models by more than 0.4 BLEU, even with smaller memory sizes.

Results.
The trend is similar on document-level translation with IWSLT14 ES-EN (Table 4b), except that ABC MLP slightly underperforms BASE by 0.2 BLEU. This suggests that even with longer sequences, ABC MLP is effective despite its bounded memory size. Linformer fails to converge even with multiple random seeds, suggesting the limitations of its purely position-based strategy in tasks involving decoding varying-length text.

Masked Language Model Finetuning
Setting. We compare the ABC variants as in §5.1. It is interesting to pretrain ABC from scratch, but we lack the resources to do so. Instead, we warm-start from a pretrained RoBERTa-base (Liu et al., 2019) trained with the softmax transformer, swap its attention with ABC variants, and continue pretraining with the masked language modeling (MLM) objective on a concatenation of BookCorpus (Zhu et al., 2015), English Wikipedia, Open-WebText (Gokaslan and Cohen, 2019), and Real-News (Zellers et al., 2019). 7 Then the models are finetuned and evaluated on downstream classification datasets from the the GLUE benchbark (Wang et al., 2019). This is an appealing setting, since it avoids reinvesting the huge amounts of resources already put into pretraining. 8 Results. Table 5 compares downstream text classification performance. BASE indicates a baseline that continues pretraining RoBERTa-base on our data. 9 Following standard practice, we report development accuracy. Linformer achieves competitive 7 Our data differs from RoBERTa's, which we do not have access to. We replace CC-News (Nagel, 2016) with RealNews, and drop Stories (Trinh and Le, 2018), whose public access is broken at the time of this work. 8 In preliminary experiments, we explored swapping in ABC, and then directly finetuning on downstream tasks without continued MLM pretraining; all models fail. 9 BASE slightly underperforms RoBERTa-base. This could be due to overfitting, or the pretraining data discrepancy.  Table 5: Text classification development set accuracy. All models continue pretraining RoBERTa-base on our data with the MLM objective. Bold numbers perform the best among ABC models, and underlined ones perform on par with or better than BASE. performance, aligned with Wang et al. (2020b)'s results. ABC MLP outperforms Linformer, and performs on par with or better than BASE, affirming the benefits of using contextualized memory organization in MLM. ABC RD fails to converge in continued pretraining even with multiple seeds.
Based on the above results, we think ABC MLP can achieve competitive performance when pretrained from scratch, just as Linformer does (Wang et al., 2020b). Further empirical exploration is beyond our budget and left for future work.

Analysis
Decoding efficiency over varying sequence lengths. ABC's efficiency gains can be more prominent for long sequences. We study ABC MLP 's decoding overhead with varying sequence lengths. Following Kasai et al. (2021b), we consider a sequence-to-sequence generation experiment. Three linear-complexity models are compared: RFA (with 256/128 cross/causal memory sizes; Peng et al., 2021), T2R (32/4;Kasai et al., 2021b), and ABC MLP (32/8). The sizes are chosen to maximize efficiency without accuracy drop. T2R needs to be finetuned from a pretrained transformer to match its performance, while others don't.
All linear-time models achieve consistent decoding speed for different lengths (Figure 1a), substantially outpacing the softmax attention baseline, especially for long sequences. In particular, ABC MLP decodes ∼1.25 times faster than RFA, another competitive model that can match transformer's accuracy without a warm start from a pretrained model. This can be attributed to the fact that ABC MLP achieves similar accuracy with a much smaller memory. T2R's memory sizes are similar to ABC MLP 's, but it decodes about 20% faster. < l a t e x i t s h a 1 _ b a s e 6 4 = " u L C 0 w C 0 7 F Q u 4 T V X B d J 7 e f I 6 Z o 0 o = " > A A A K M 3 i c b d Z J b 9 t G G A Z g O t 0 i p Y v T n o p e i B o F e h A M M p F t 9 R b L + y 4 v W m x T M I a j 4 Y g x t w y H t m W C 6 K / p t b 3 k z y T H o t f + h 1 K y 3 P e r 1 Q E E D J / v 4 6 K Z 9 z B u E v i p t q w P c 8 8 + + f S z z 7 9 4 X q m + + P K r r 7 + Z f / l t J 4 0 z x U W b x 0 G s e i 5 L R e B H o q 1 9 H Y h e o g Q L 3 U B 0 3 e u 1 c b 1 7 I 1 T q x 9 G Z H i W i H z I Z + Z 7 P m S 7 p a v 5 7 R 4 s 7 n f J 8 1 e X F V T 6 5 y g / 2 W 0 V x N b 9 g L V q T Y c 5 O 7 O l k w Z i O 1 t X L y p w z i H k W i k j z g K X p p W 0 l u p 8 z p X 0 e i K L q Z K l I G L 9 m U l y W 0 4 i F I u 3 n k / 9 Q m D + V M j C 9 W J W / S J s T p X f k L E z T U e i W n S H T w / R p b Y z / V 7 v M t N f o 5 3 6 U Z F p E / O F F X h a Y O j b H C 2 I O f C W 4 D k b l h H H l l 9 9 q 8 i F T j O t y 2 a r O Q H j l 0 k 4 + J w 9 H b p C J I j / Z a h Z 5 f b n W a N T s u l U 8 b V J i M O 2 x G 3 Z t + Z d a Y 3 m m J 1 Y s k o + P e m X V a 6 Z d f 1 0 z l 5 Z m O p N M J c F j p / 2 6 7 K y v l N 1 L s 8 + U S o h o 2 m j V 7 P L F Z d t K U a 1 O G p 2 b e 6 H i P H f G S + R 6 u V U U x b Q Q R w J u w 8 M M 7 I Q Z C n o o N C O 1 y T X K p E T U h b p Q D u X Q A X Q A J V 8 p o B 7 U g 0 q o h A 6 h Q 6 g P 9 a F v o W + h 1 9 B r a A A N y P p B Q 2 g E j c g e Q G N o A k 2 g 7 6 D v o A q q o C k 0 J R s I 1 V C y 3 W S z b 6 A 3 0 F v o L f Q O e g c d Q U f Q e + j 9 W B 8 4 X A W v / t s c N q F N 6 B p 0 D b o O X Y d u Q D e g m 9 B N 6 B Z 0 C 7 o N 3 Y b u Q H e g u 9 B d 6 B 5 0 D 7 o P 3 Y c e Q A + g h 9 B D 6 B H 0 C N q C t q D H 0 G P o C f Q E e g o 9 h Z 5 B z 6 B t a B v a g X a g X W g X 2 o P 2 o O f Q c + g F 9 A L h k Y / h 4 S w g 6 Z F N w o i P l 4 Q n s T K i c Q t j 8 O Q R Y P c i W I V F p d 2 P 3 c C 4 W k n 6 A i l F 2 x H + X K o H T W + G h + a 7 K d H p N l J 9 9 W i X V + 0 7 e P 6 w p v m 9 P z 0 3 P j B + N H 4 2 b C N F e O N s W 2 0 j L b B j V + N 3 4 z f j T 8 q 7 y s f K 3 9 W / n p o f T Y 3 v e c 7 4 z + j 8 v c / k g P R C w = = < / l a t e x i t > AbcMLP (a) Decoding Speed. < l a t e x i t s h a 1 _ b a s e 6 4 = " u L C 0 w C 0 7 F Q u 4 T V X B d J 7 e f I 6 Z o 0 o = " > A A A K M 3 i c b d Z J b 9 t G G A Z g O t 0 i p Y v T n o p e i B o F e h A M M p F t 9 R b L + y 4 v W m x T M I a j 4 Y g x t w y H t m W C 6 K / p t b 3 k z y T H o t f + h 1 K y 3 P e r 1 Q E E D J / v 4 6 K Z 9 z B u E v i p t q w P c 8 8 + + f S z z 7 9 4 X q m + + P K r r 7 + Z f / l t J 4 0 z x U W b x 0 G s e i 5 L R e B H o q 1 9 H Y h e o g Q L 3 U B 0 3 e u 1 c b 1 7 I 1 T q x 9 G Z H i W i H z I Z + Z 7 P m S 7 p a v 5 7 R 4 s 7 n f J 8 1 e X F V T 6 5 y g / 2 W 0 V x N b 9 g L V q T Y c 5 O 7 O l k w Z i O 1 t X L y p w z i H k W i k j z g K X p p W 0 l u p 8 z p X 0 e i K L q Z K l I G L 9 m U l y W 0 4 i F I u 3 n k / 9 Q m D + V M j C 9 W J W / S J s T p X f k L E z T U e i W n S H T w / l 4 Q n s T K i c Q t j 8 O Q R Y P c i W I V F p d 2 P 3 c C 4 W k n 6 A i l F 2 x H + X K o H T W + G h + a 7 K d H p N l J 9 9 W i X V + 0 7 e P 6 w p v m 9 P z 0 3 P j B + N H 4 2 b C N F e O N s W 2 0 j L b B j V + N 3 4 z f j T 8 q 7 y s f K 3 9 W / n p o f T Y 3 v e c 7 4 z + j 8 v c / k g P R C w = = < / l a t e x i t > AbcMLP (b) Decoding memory overhead. Figure 1: Sequence-to-sequence decoding speed (top) and memory consumption (bottom) varying sequence lengths. Greedy decoding is used, with batch size 16. This is because it does not compute the softmax when calculating attention output, while ABC MLP does (Eq. 4). These results show that ABC MLP is an appealing modeling choice for decoding tasks, especially when training from scratch is desired.
ABC MLP also achieves significant savings in terms of memory overhead (Figure 1b). ABC MLP , RFA, and T2R's curves are similar.
Text encoding efficiency. We compare the efficiency of ABC MLP against softmax attention and Linformer when used as text encoders. The models' sizes mirror those in the MLM experiment ( §5.3). Table 6 summarizes inference time and memory overhead with 512-length inputs, batch size 16. Both ABC MLP and Linformer achieve inference speed gains and memory savings over BASE. Linformer is faster, since its linear projection is cheaper to compute than ABC MLP 's MLP. Inference speed is measured on the same V100 GPU. The trend in memory overhead is similar.
Although ABC MLP slightly underperforms Linformer in terms of inference speed, it can be a more appealing architectural choice in practice: in all of our 5 experiments, ABC MLP outperforms other ABC models in accuracy. Linformer, in contrast, fails to converge or yields sub-optimal performance on some tasks. This confirms its flexibility and ap-   plicability in various settings.
Memory size's impact on accuracy. Practically, one may want to minimize the memory size to improve efficiency. We use the WMT14 EN-DE experiment to investigate how memory size affects accuracy. Using the §5.2's setup, we vary ABC MLP 's cross and causal attention memory sizes and compare their translation quality on the development data. They are selected from {8, 16, 32, 64}, with cross attention's equal to or larger than causal's: cross attention is more important than causal attention in machine translation (Michel et al., 2019). Our results (Table 7) align with this observation: when cross attention memory is large enough, reducing causal attention memory size from 64 to 8 has a minor 0.3 BLEU drop. Surprisingly, ABC MLP with 8-8 sized cross-causal memory is only 1.1 BLEU behind the best-performing configuration.

Conclusion
We presented attention with bounded-memory control (ABC). It provides a unified perspective of several recently-proposed models, and shows that they vary in the organization of the bounded memory. ABC reveals new insights into established methods and inspires new architectures. We proposed ABC MLP , a particular instance of ABC that learns a contextualized memory control. On language modeling, machine translation, and masked language model finetuning, ABC MLP outperforms previous ABC models. Compared to the strong transformer baseline, ABC MLP achieves substantial efficiency improvements with no or negligible accuracy loss.

A.1 Sparse Local-to-global Attention
It sparsifies attention pattern to reduce the number of tokens that are attended to (Beltagy et al., 2020;Zaheer et al., 2020, inter alia). All queries attend to a subset of n < N "global tokens," while ignoring others. Therefore the effective context size is reduced to n. The global tokens are usually preselected by positions according to some heuristics. Local-to-global attention is an instance of ABC: it can be recovered by letting ϕ t = e i if x t is the ith global token (i = 1, . . . , n), and the zero vectors for others.

A.2 Random Memory Control
As a baseline, ABC RD stores each token in a randomly-selected memory slot. This is achieved by letting ϕ t = e it , where i t is uniformly drawn from {1, . . . , n} for each t. It is designed as a baseline to ABC MLP and Linformer to quantify the differences between random and learned boundedmemory control. Random sparse attention patterns are explored by Zaheer et al. (2020), where a subset of n < N tokens are randomly selected to be attended to by all tokens. ABC RD is different, and it attends to all tokens, but randomly "squash" them into an n-slot memory.

A.3 Compressive Transformer with Mean Pooling
The compressive transformer (Rae et al., 2020) explores various ways to "squash" long context into smaller and more compact representations. It achieves state-of-the-art performance on several language modeling benchmarks. We show that at least the mean-pooling variant of the compressive transformer can be seen as an ABC instance. The mean-pooling variant of the compressive transformer compresses the context by where c = N/n is the compression ratio. Here N mod n = 0 is assumed, since otherwise the sequence can be padded to. The above model is an ABC instance by letting

A.4 Dilated Convolution Attention Patterns
The dilated attention pattern is similar to the sliding window attention and only considers the context within a predefined window. It differs in that it attends to every other token: It can be simulated with two separate queues K odd and K even : Likewise for the values. Depending on t, the query attends to one of the two queues: output = if t is odd V even ⊤ softmax( K even q t ), otherwise.
The above implementation could incur considerable amount of overhead and may be actually more expensive than the the original dilated window formulation. Therefore it has more conceptual value than practical value.

A.5 Shared Workspace and Linear Unified Nested Attention
Concurrently to this work, shared workspace (SW; Goyal et al., 2021) and linear unified nested attention (LUNA; Ma et al., 2021) also propposed methods to learn contextualized memory control strategies. Both can be seen as instances of ABC. At layer ℓ, their ϕ ℓ i is a function of previous layer's memory X ℓ−1 ∈ R n×d and current layer's input X ℓ ∈ R N ×d : where [·] :,i denotes the ith column of a matrix. Query, key, and value projections are suppressed for notation clarity.
SW and LUNA reveal the entire sequence to the control vectors, by constructing ϕ as a function of previous layer's memory. Although both admit the recurrent computation as all ABC models do, they are ill-suited for causal attention and autoregressive decoding, since future information is "leaked" to ϕ i from the previous layer. LUNA resorts to a variant of Katharopoulos et al. (2020) in causal attention (Ma et al., 2021). In contrast, ABC MLP never conditions ϕ i on previous layer's memory, but only on the current layer's input.
B More Details about ABC-MLP

B.1 Normalization in Causal Attention
An equivalent implementation to Eq. 7 is to normalize K and V instead of ϕ i vectors: M/z divides the ℓth row of matrix M by vector z's ℓth dimension. This admits a linear complexity computation graph for the causal variant of ABC MLP .

B.2 Higher-Dimensional Case of Example 1
This section generalizes Example 1 to higher dimensional cases. Assume that the constant-sized memory has n slots. ϕ i is cauculated as in Eq. 7. Then K = N i=1 ϕ i ⊗ k i ∈ R n×d . Each row of K can be seen as a separate attention mechanism with a pseudo query. Let [·] ℓ denote the ℓth row/dimension of a matrix/vector. Then for any ℓ = 1, . . . , n, In other words, there are n attention mechanisms in total, each with a separately-parameterized pseudoquery [W ϕ ] ℓ . They summarize the context for n times in parallel, each producing a d-dimensional vectors. These output vectors are then stacked into n-by-d memory K. V is similar.

C.1 Language Modeling
We closely build on Baevski and Auli (2019) and Kasai et al. (2021b). The hyperparameters are summarized in Table 10. All models are trained on 4 A100 GPUs.

C.2 Machine Translation
We experiment with a sentence-level ( We average the checkpoints from the last five epochs to obtain the final model (Vaswani et al., 2017). In inference, we apply beam search with size 5 and length penalty 0.6. Other hyperparameters are summarized in Table 11. All models are trained on 4 RTX 2080 Ti GPUs.
Additional machine translation results. In addition to the results presented in §5.2, Table 8 further compares, on the WMT14 EN-DE dataset, the clustering-based ( §3.2) and sliding-window ( §3.3) models of ABC, as well as ReLU and sigmoid variants of ABC MLP . Clustering and sliding-window ABC variants underperform ABC MLP with the same memory sizes by more than 0.5 BLEU. Both ReLU and sigmoid underperform their exp counterpart.
MLP-exp-all replaces the encoder's softmax attention modules with ABC, in addition to the decoder's. It underperforms ABC MLP by only 0.3 BLEU.    2021b), we consider a synthetic sequence-to-sequence generation task with varying sequence lengths. A batch size of 16 and greedy decoding is used. The models are of the same size as those in §5.2.

C.3 Masked Language Model Finetuning
Our data for continued pretraining is a concatenation of BookCorpus (Zhu et al., 2015), English Wikipedia, OpenWebText (Gokaslan and Cohen, 2019), and RealNews (Zellers et al., 2019). Our data differs from RoBERTa's pretraining data, which we do not have access to. We replace their CC-News (Nagel, 2016) with RealNews, and drop Stories (Trinh and Le, 2018). At the time of this project, the public access to the Stories dataset is broken. 10 Our machine does not have a large enough memory to load all the data. We therefore split the training data into 20 shards, after shuffling. Other preprocessing is the same as Liu et al. (2019). 11 The hyperparameters for continued pretraining follow base-sized RoBERTa, part of which are summarized in