Modeling Context With Linear Attention for Scalable Document-Level Translation

Document-level machine translation leverages inter-sentence dependencies to produce more coherent and consistent translations. However, these models, predominantly based on transformers, are difficult to scale to long documents as their attention layers have quadratic complexity in the sequence length. Recent efforts on efficient attention improve scalability, but their effect on document translation remains unexplored. In this work, we investigate the efficacy of a recent linear attention model by Peng et al. (2021) on document translation and augment it with a sentential gate to promote a recency inductive bias. We evaluate the model on IWSLT 2015 and OpenSubtitles 2018 against the transformer, demonstrating substantially increased decoding speed on long sequences with similar or better BLEU scores. We show that sentential gating further improves translation quality on IWSLT.


Introduction
Sentence-level neural machine translation has seen significant recent progress (Bahdanau et al., 2015;Vaswani et al., 2017).A move to document-level translation opens the possibility of using intersentential context at the scale of paragraphs, documents, or even whole books (Lopes et al., 2020;Ma et al., 2021b;Maruf et al., 2021).This opens up new research avenues to improve translation and its evaluation for more consistent anaphora resolution and discourse coherence (Bawden et al., 2018;Müller et al., 2018;Voita et al., 2019).
Transformers have enabled state-of-the-art results for sentence-level translation (Vaswani et al., 2017;Chen et al., 2018;Wang et al., 2019) and this success has made them the default architecture for document translation.However, they do not scale well in the sequence length due to the quadratic complexity of attention and hence are computationally prohibitive to apply to long text.Alternative architectures exist, but most still have quadratic complexity (Zhang et al., 2018;Voita et al., 2019) and/or have extra modules that further add to the inference cost (Tu et al., 2018;Zhang et al., 2018;Miculicich et al., 2018;Donato et al., 2021).
By reducing asymptotic complexity, recent work on efficient attention may pave the way for long text generation.However, these methods' suitability for document translation requires further investigation: some do not focus on decoding speed (Guo et al., 2019;Child et al., 2019;Kitaev et al., 2020;Wang et al., 2020, i.a.), while others' speedup and quality impact for document translation remains unknown (Kasai et al., 2021;Schlag et al., 2021;Ma et al., 2021a;Choromanski et al., 2021, i.a.).In this work, we consider random feature attention (RFA; Peng et al., 2021), a representative first model with established accuracy and efficiency in sentencelevel translation.With few additional parameters, it approximates softmax attention in linear time and space with recurrent computations.We explore its effectiveness for document translation and find substantial decoding speedup over a transformer with similar or improved BLEU.We also equip RFA with a sentential gate, injecting a recency inductive bias tailored to representing document context.
Our main contributions are: (i) we study the efficacy of RFA for document translation; (ii) we validate that RFA is competitive with a transformer but substantially faster on long documents; (iii) we augment RFA with a sentential gate designed to promote a recency bias, which brings a 0.5 BLEU improvement on IWSLT (Cettolo et al., 2015).To encourage future research, we release our code.more coherent translation, improving lexical choice and ambiguity resolution (Voita et al., 2019).
The Concatenation Model.Recent studies found that the concatenation model that directly translates the source document to the target document (or a multi-sentence window) with a single encoder-decoder model performs well (Tiedemann and Scherrer, 2017;Ma et al., 2021b), especially on large datasets (Junczys-Dowmunt, 2019).In each, every query q t is dotted with all keys {k i } to obtain the attention weights, with which a weighted average of the values {v i } is calculated: where N is the sequence length.This pairwise interaction incurs quadratic overhead in N , which is inefficient for the long text sequences in the concatenation model.This particularly impacts cross and causal attention at decoding time, which cannot be parallelized (Kasai et al., 2021).2

Scalable Document-Level Translation
We test RFA as a linear time and space model to improve the efficiency of document translation.We also augment it with a sentential gate to circumvent capacity constraints that come with a long context by injecting a recency inductive bias.

••• = [SEP]
Figure 2: Our sentential gating mechanism.e 1 and e 4 are at the beginnings of two sentences.

Random Feature Attention
RFA approximates the softmax attention attn (q t , {k i }, {v i }) in linear time and space: The randomized nonlinear transformation ϕ(•) (Rahimi and Recht, 2008).S, z summarize keys and values.We use RFA in cross and causal attention, which are the most impactful for speed and memory, so q t is always from the target sequence.
In cross attention, S and z represent the source sequence and are constant for all query positions t: . These recurrent computations are analogous to an RNN with S t and z t as hidden states at step t and enable constant computation per timestep.RFA serves as a drop-in replacement for attention in transformers.The encoder and other modules, e.g., feed-forward layers, remain the same.We refer the reader to Peng et al. (2021) for more details on RFA.

Sentential Gating
Schlag et al. ( 2021) noted, under the lens of fast weight programmers (Schmidhuber, 1991(Schmidhuber, , 1992(Schmidhuber, , 1993)), that accumulating memory in a purely additive manner, like S and z above, will reach a capacity limitation with sequences longer than the size of ϕ.This is particularly an issue in documentlevel translation due to the long sequences.
Nevertheless, document translation admits a natural solution to this constraint: inspired by gated RNNs (Cho et al., 2014, i.a.), we augment RFA with a sentence-level gate to enable dynamic control of contextual information from the current and previous sentences, and to allow the model to selectively forget about the history to circumvent the capacity constraint.This is illustrated in Figure 2.For the tth word with representation e t , we compute a forget gate using the sentence separator token: σ is the sigmoid function; w f and b f are learned parameters.The context is decayed when a new sentence starts, and the decay coefficient is reused for all tokens in the same sentence.Specifically, each sentence j assigns a weight 0 < ∏︁ START(j) i=START(j ′ )+1 f i < 1 when attending to a previous sentence j ′ , where START(•) is the first token in a sentence.This introduces an inductive bias that, intuitively, previous sentences are less important in translation, and their representations are decayed.
Relation to Prior Work.While gating is common in RNNs, it is less clear how it applies to transformers.Miculicich et al. (2018) gated at the sentence level, though hierarchically, while we gate recurrently.Ours also contrasts with the per-token gating of Peng et al. (2021) which they found ineffective for machine translation.These two works also take a weighted average of the previous and current sentences while we only decay the former.We show our variant performs better in §5.Schlag et al. (2021) used a gate that explicitly models memory removal, but also at the token level.

Experimental Setup
Datasets and Evaluation.We experiment with the IWSLT 2015 Chinese-to-English (zh-en) dataset (Cettolo et al., 2015) with multilingual TED talk captions and the OpenSubtitles2018 Englishto-Russian (en-ru) dataset (Lison et al., 2018) with movie and TV subtitles.We measure documentlevel BLEU (Papineni et al., 2002) with Sacre-BLEU (Post, 2018). 3To quantify discourse consistency, we also use the test sets by Voita et al. (2019) based on OpenSubtitles.We introduce these datasets and their statistics in more detail in §A.1.Data Processing.We process each document with a stride-one sliding window of L sentences to obtain our training set.Following Voita et al. (2019) and Ma et al. (2021b), we experiment with L = 1, the sentence-level baseline, and L = 4.During inference, we use the last translated sentence in each window for evaluation.For a more granular analysis, we consider L ∈ {1, 2, 3, 4} for consistency experiments.More details are in §A.1.

IWSLT Subtitles Window
Model Settings.We compare RFA and transformer with the concatenation model.For RFA, we experiment with no gating (RFA) and sentential gating (RFA-sgate).To compare our decaying gate choice with prior work ( §3.2), we run a sententialgated RFA that takes a weighted average of previous and current text (RFA-sgate-avg).We mostly default to fairseq hyperparameters (Ott et al., 2019), most suitable for the L = 1 transformer ( §A.2).

BLEU Scores.
Table 1 shows BLEU scores on IWSLT and OpenSubtitles.The sentence-level transformer has the highest IWSLT BLEU, possibly due to defaulting to fairseq hyperparameters optimized for this setting.With L = 4, RFA performs slightly better than the transformer, showing suitability for long-text translation.Gated RFA further brings a 0.5 BLEU improvement, confirming its utility, but gating has no effect on OpenSubtitles.We hypothesize that with only ≈10 tokens per sentence, half of the average length of IWSLT (Table 2, appendix), the capacity constraint (Schlag et al., 2021) is less severe and thus gating would be less useful.Our gate also outperforms the averaging variant in Miculicich et al. (2018) and Peng et al. (2021), validating its effect on document translation.Aligning with prior findings (Voita et al., 2019;Ma et al., 2021b), longer contexts do not clearly lead to better BLEU, though it improves consistency metrics, to which we turn next.
Discourse Consistency Scores.(2021b).We also re-implement this transformer model to control confounding factors in implementation details and to extrapolate to L < 4, which they did not thoroughly explore.We compare to a baseline that chooses its prediction randomly from candidate translations; see §A.1 for details.
Translating with longer contexts almost always yields better consistency, which is also a setting where RFA achieves better speedup, shown later.Gating does not have a clear benefit, aligning with OpenSubtitles' BLEU pattern.RFA slightly underperforms the transformer on ellipsis.We hypothesize that the direct query-key interaction in softmax attention is more suitable for precise long-distance  information extraction, sometimes required for consistency, than the RFA approximation.On lexical cohesion, transformer shows a large variance: with the same architecture and size, Ma et al. (2021b), Voita et al. (2019), and our implementation of L = 4 transformer achieve drastically different scores.Voita et al. (2019)'s implementation and RFA fail to outperform the random baseline on this phenomenon.Reliable evaluation of lexical cohesion, and the related task of word sense disambiguation, are known to be challenging in document translation: models tend to rely on dataset artifacts but not the context, and the attention of wellperforming models poorly aligns with the groundtruth required for disambiguation (Kim et al., 2019;Emelin et al., 2020;Yin et al., 2021).
Speed.We confirmed the observation from prior work that longer context boosts translation consistency and sometimes BLEU.It would be exciting to examine this trend with L > 4, but to our knowledge, it has little existing evaluation data.We therefore measure decoding efficiency with a synthetic experiment by decoding for all L with the same trained model.We focus only on efficiency here, not quality.We measure the number of decoded tokens per second over the forward pass time on IWSLT's test set.We follow Ott et al. (2018) and cache k and v for our baseline which substantially increases its speed.More details are in §A.3.
Figure 4 shows RFA's speedup relative to the transformer.On GPU, without document context, RFA is slower, likely due to its random matrix overhead.Nevertheless, its speed over the transformer roughly linearly increases, agreeing with the theory, up to 2.09× on our longest tested context L = 15.
RFA enables an even more substantial speedup on other device types.For sentence-level translation, RFA is in fact faster than the transformer by 1.58× on CPU, and, as Peng et al. (2021) reported, by 1.8-1.9× on TPU.At L = 15, its CPU speedup increases to 19.2×.Therefore, depending on the use case, such as when decoding on edge devices, RFA could be even more favorable.Furthermore, we used the same batch size between RFA and the transformer.With lower memory complexity, RFA accommodates a larger batch size and in practice achieve a more significant speedup.For example, at L = 15 on GPU, we found that RFA allows a 5× batch size and enables a more than 7× speedup.
RFA's superior speed makes it an attractive choice to leverage very long contexts.Nevertheless, we are merely extrapolating the utility of long context from our experiments.The extent to which it really helps needs to be verified by future curated test sets.We hope the demonstration of our model's ability to efficiently and effectively process document context could catalyze such efforts.

Conclusion
We explored the effectiveness of random feature attention on document translation.Our model substantially improves its speed over a transformer with similar or improved BLEU.Our sentential gate also proves effective, especially on long sequences.While our model may potentially be used to produce toxic or fake information, it also enables more efficient detectors toward such content.

Limitations
Limited by existing document translation datasets where "documents" are usually relatively short multi-sentence windows, we adopted a semisynthetic setup for our speed benchmark experiments to examine RFA's effectiveness on long sequences.We believe our results should transfer to real data since decoding speed is mostly a function of sentence length, but this is not a guarantee.Additionally, while RFA would enjoy a better speedup on TPUs as reported in the original RFA paper, we did not have the necessary resources to run experiments on TPUs, so our setup does not fully leverage RFA's potential. 1

Figure 1 :
Figure 1: The concatenation model for document translation with a sliding window of length L = 4. Every window is translated in its entirety, but only the last translated sentence is used for evaluation.The purple bars denote the sentence separator token.
Figure 1 illustrates this model combined with sliding window decoding, which we adopt in this work.Scalability of Attention.In machine translation, transformers have three types of attention: encoder self-attention, cross attention, and causal attention.
Figure 3 plots the consistency scores in four phenomena for RFA, including our gated variants, and the transformer baseline from Voita et al. (2019) and Ma et al.

Figure 3 :
Figure 3: Model performance on the consistency test set, broken down into phenomena.Transformer and RFA are tested with window sizes from 1 to 4. We compare with the baselines in Voita et al. (2019) and Ma et al. (2021b) corresponding to our Transformer L = 4.

Figure 4 :
Figure 4: RFA's inference speedup over the transformer in the number of decoded tokens per second.Each sentence has ≈ 20 tokens (Table 2, appendix).

Table 1 :
BLEU on IWSLT and OpenSubtitles test sets.Bold scores outperform the transformer.