On Sparsifying Encoder Outputs in Sequence-to-Sequence Models

Sequence-to-sequence models usually transfer all encoder outputs to the decoder for generation. In this work, by contrast, we hypothesize that these encoder outputs can be compressed to shorten the sequence delivered for decoding. We take Transformer as the testbed and introduce a layer of stochastic gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty, resulting in completely masking-out a subset of encoder outputs. In other words, via joint training, the L0DROP layer forces Transformer to route information through a subset of its encoder states. We investigate the effects of this sparsification on two machine translation and two summarization tasks. Experiments show that, depending on the task, around 40-70% of source encodings can be pruned without significantly compromising quality. The decrease of the output length endows L0DROP with the potential of improving decoding efficiency, where it yields a speedup of up to 1.65x on document summarization tasks against the standard Transformer. We analyze the L0DROP behaviour and observe that it exhibits systematic preferences for pruning certain word types, e.g., function words and punctuation get pruned most. Inspired by these observations, we explore the feasibility of specifying rule-based patterns that mask out encoder outputs based on information such as part-of-speech tags, word frequency and word position.


Introduction
Neural sequence-to-sequence (Seq2Seq) models have dominated various text generation tasks, including machine translation (Vaswani et al., 2017) and abstractive document summarization (Gehrmann et al., 2018;Liu and Lapata, 2019). 1 Source code is available at https://github.com/ bzhangGo/zero.  Figure 1: Distribution of the summed attention weight per source word estimated on the English-German WMT14 test set. For each (source sentence, translation) pair, we extract the attention matrices from all encoder-decoder attention sublayers in Transformer and average them over different (8) heads and (6) layers. The attention value for each source word is summed over all target words in the translation. Higher attention weights suggest larger impacts on translation. Around 49.7% source words get attention weights of less than 0.6, compared to the mean value of 1.03.
These models generally follow the encoder-decoder paradigm, where the encoder interprets source context and converts source words into vector representations such that the decoder has sufficient information to predict the target sequence. Early Seq2Seq models (Sutskever et al., 2014;Cho et al., 2014) provided only the last and/or first encoder states to the decoder. In contrast, modern approaches rely on the attention mechanism (Bahdanau et al., 2015) and implicitly make an assumption that information from all encoder outputs should flow to the decoder. However, this assumption neglects the fact that a large portion of source words in machine translation receives just minor attention as shown in Figure 1, let alone in summarization where the input contains redundant expressions and large parts of text are not relevant to any plausible summary. Moreover, information content varies across words, for example, it is negatively correlated with event frequency (Shannon, 1948;Zipf, 1949). 2 In this work, we hypothesize that encoder out- words (y-axis) over source words (x-axis) for the vanilla attention (Vaswani et al., 2017), the sparse attention (Correia et al., 2019) and our model. Darker color indicates larger attention weight, and the white blocks denote an attention weight of 0. The source words whose encoding is pruned by L0DROP (receiving zero weight) are highlighted in red.
puts are compressible and we can force Seq2Seq model to route information through their subset. Figure 2 illustrates our intuition as well as the difference with existing work (Vaswani et al., 2017;Correia et al., 2019). Instead of dynamically sparsifying attention weights for individual decoder steps (Correia et al., 2019), we aim at detecting uninformative source encodings and dropping them to shorten the encoding sequence before generation. To this end, we build on recent work on sparsifying weights (Louizos et al., 2018) and activations (Bastings et al., 2019) of neural networks. Specifically, we insert a differentiable neural sparsity layer (L 0 DROP) in-between the encoder and the decoder. The layer can be regarded as providing a multiplicative scalar gate for every encoder output. The gate is a random variable and, unlike standard attention, can be exactly zero, effectively masking out the corresponding source encodings. The sparsity is promoted by introducing an extra term to the learning objective, i.e. an expected value of the sparsity-inducing L 0 penalty. By varying the coefficient for the regularizer, we can obtain different levels of sparsity. Importantly, the objective remains fully end-to-end differentiable.
Given an encoding sequence of length N , the vanilla attention model attends to it recurrently for M steps at the decoding phase, leading to a computational complexity of O(N M ) (N = 6, M = 6 in Figure 2). This could be costly if N or M is very large. With the induced sparse structure by L 0 DROP, we introduce a specialized decoding algorithm which lowers this complexity to O(N M ) (N ≤ N , and N = 3 in Figure 2). As a result, L 0 DROP provides a chance to improve decoding efficiency by reducing the encodings' length, especially for long inputs.
We apply L 0 DROP to Transformer (Vaswani et al., 2017), the state-of-the-art Seq2Seq model. We conduct extensive experiments on WMT translation tasks with two language pairs and document summarization tasks covering single document and multiple documents settings. We analyze how pruning source encodings impacts the generation quality and which word types get pruned. We also explore rule-based sparsity patterns inspired by the analysis of L 0 DROP, such as deterministically filtering out the encodings of words with specific POS tags, high-frequency words or simply attending to every other word in the sequence.
Our main findings are summarized as follows: • We confirm that the encoder outputs can be compressed, around 40-70% of them can be dropped without large effects on the generation quality. • The resulting sparsity level differs across word types, the encodings corresponding to function words (such as determiners, prepositions) are more frequently pruned than those of content words (e.g., verbs and nouns). • L 0 DROP can improve decoding efficiency particularly for lengthy source inputs. We achieve a decoding speedup of up to 1.65× on document summarization tasks. • Filtering out source encodings with rulebased sparse patterns is feasible, and confirms information-theoretic expectations, although rule-based patterns do not generalize well across tasks.

Related Work
Approaches to compression in Seq2Seq models fall into the category of model parameter compression (See et al., 2016), sequential knowledge distillation (Kim and Rush, 2016) or sparse attention induction that ranges from modeling hard attention (Wu et al., 2018) to developing differentiable sparse softmax functions or regularizing attention weights for sparsity (Niculae and Blondel, 2017;Correia et al., 2019;Cui et al., 2019;). Unfortunately, the success of all these studies builds upon the access to all source encodings in training and decoding. Learning which encoder outputs to prune in Seq2Seq models, to the best of our knowledge, has never been investigated before. Sukhbaatar et al. (2019) learn attention spans in self-attention and discard information from states outside of the span; this method is not directly applicable to encoder-decoder attention.
We use the differentiable L 0 -relaxation which was first introduced by Louizos et al. (2018) in the context of pruning individual neural network parameters. It was previously used to prune heads in multi-head attention (Voita et al., 2019). Our work is more similar in spirit to Bastings et al. (2019) where they used the L 0 relaxations to construct interpretable classifiers, i.e. models that can reveal which words they rely on when predicting a class. In their approach, the information from dropped words is lost rather than rerouted into the states of retained words, as desirably for interpretability but problematic in the text generation set-up.
The number of the source encodings selected by L 0 DROP is sentence-dependent, which differs from the linear-time model of Wang et al. (2019), although both can accelerate decoding. Our study of rule-based sparsity patterns is in line with the sparse Transformer (Child et al., 2019) though we also explore the use of external linguistic information (POS tag) in our sparsification rules, and focus on encoder outputs instead of self-attention.

Background: Transformer
We take Transformer (Vaswani et al., 2017) as our testbed. Transformer uses the dot-product attention network as its backbone to handle intra-and intersequence dependencies: where Q, K, V = HW q , MW k , MW v . The input H ∈ R J×d of length J queries and summarizes task-relevant clues from the memory M ∈ R I×d of length I based on their dot-product semantic matching A ∈ R J×I . SM denotes the softmax function, d is the model dimension, and W q , W k , W v ∈ R d×d are trainable model parameters. Vaswani et al. (2017) also extend this mechanism to multi-head attention. Given a source sequence X = (x 1 , x 2 , . . . , x N ), Transformer maps it to the target sequence Y = (y 1 , y 2 , . . . , y M ) following the encoder-decoder paradigm (Bahdanau et al., 2015): 3 3 Each sublayer (ATT/ATT/FFN) in the encoder and decoder is wrapped with residual connection (He et al., 2015) followed by layer normalization (Ba et al., 2016), which are dropped in Eq. (2) and (3) where X 0 ∈ R N ×d and Y 0 ∈ R M ×d stand for the source and the shifted target sequence embedding, respectively, enriched with positional encoding (Vaswani et al., 2017). FFN(·) is a point-wise feed-forward network. ATT(·, ·) in the decoder denotes masked ATT(·, ·) which prevents access to future target words. Both the encoder and the decoder involve a stack of L = 6 identical layers, with the encoder output X L fed to the decoder via an encoder-decoder attention sublayer, i.e. the ATT(·, ·) in Eq.
(3). Based on the decoder output Y L , Transformer performs the next-word prediction and adopts the maximum likelihood loss for training.

Neural Sparsity Layer: L 0 DROP
In this section, we introduce a neural sparsity layer (L 0 DROP), which we use to prune encoder outputs. At inference time, only retained encoder outputs will be used as input to the decoder.

Training with
and prunes encodings by closing their gates, i.e. g i = 0, relying on adding a differentiable sparsityinducing penalty to the objective. More formally, to achieve sparsity, each gate is assumed to be a random variable and its value is drawn from the HardConrete distribution: where α i , β and are shape parameters of the distribution. HardConcrete (Louizos et al., 2018) is a parameterized family of mixed discrete-continuous distributions over the closed interval [0, 1]. These distributions have point mass at 0 and 1 and continuous density in-between, i.e. in (0, 1), as shown in Figure 3. Thus, the gates will have a non-zero probability of being exactly 0, corresponding to masking out the input completely. Specifically, the sample from HardConcrete distribution is obtained by stretching and rectifying samples from BinaryConcrete distributions (Maddison et al., 2017;Jang et al., 2017): In the above expression, we first obtain a sample from the BinaryConcrete distribution (Eq. (6)), then stretch it from (0, 1) to (− , 1 + ) (Eq. (7), > 0), and finally rectify with a hard sigmoid to the closed interval [0, 1] (Eq. (8)).
Note that the probability of g i being exactly 0 (p(g i = 0|α i , β, )) equals the probability of s i hitting (− , 0) and is available in a closed form (Louizos et al., 2018): where σ(·) denotes the sigmoid function. The parameter α i (i.e. the location parameter of Bina-ryConcrete) is predicted relying on the encoder output x i : where w ∈ R d is a learned parameter vector; the temperature β and the stretch degree are treated as hyperparameters. By adjusting α i the model can change the shape of the HardConcrete distribution, and dynamically decide which outputs to pass to the decoder and which to prune. Note that the sum yields the expected number of open gates, or, equivalent, the expected L 0 loss on gate vector (g 1 , . . . , g N ). Minimizing the loss encourages the model to prune encoder outputs.
Once L 0 DROP is integrated as a new layer into Transformer, the decoder, previously defined in Eq.
(3), becomes: Other components in Transformer are kept intact, except for using a modified training objective L(X, Y ): where φ is short for (α, β, ), λ ∈ R + is a hyperparameter defining the level of sparsity. The bound is derived by applying Jensen's inequality. Importantly, the objective remains fully differentiable as we can rely on the reparameterization technique (Kingma and Welling, 2013) to samplẽ g for computing unbiased estimates of the gradients. Adding L 0 DROP and the regularizer introduces only a negligible computational overhead to training compared to the original Transformer.

Decoding with L 0 DROP
At test time we do not sample gate values but estimate their expected value g i as follows (Louizos et al., 2018): which often turns out to be exactly either 0 or 1, albeit being in-between in some cases. Encodings of non-zeroĝ i are preserved and simply weighted by the gate.
To leverage the induced sparse structure, we revise the decoding procedure as in Algorithm 1. The notation [·, ·] refers to row-wise concatenation, [I] stands for extracting elements with the indices I, is element-wise multiplication, and 1 ∈ R N indicates a vector of ones of length N . We first reorganize the gatesĝ ∈ R N and the source encodings X L ∈ R N ×d by eschewing the entries corresponding to closed gates (ĝ i = 0, line 1-2). We augment the compressed sequence X L ∈ R N ×d with a dummy zero encoding vector 0 ∈ R d to represent all pruned encodings, and record their count into a counting vector c ∈ R N +1 (line 4). 4 Algorithm 1 Algorithm for the encoder-decoder attention with L 0 DROP at decoding Input: Source encodings, X L ∈ R N ×d ; Gates,ĝ ∈ R N ; Query state, y l j ∈ R d ; Output: Attention vector for the query step 1: reorganize source-side inputs 1: step 2: attention with counts We then modify the attention process to enable the inclusion of this counting information (line 5-8) for correctly estimating the attention weights. Note that the shortened source sequenceX L is reused across decoder layers and steps. L 0 DROP changes the dependency of the encoder-decoder attention on source sequence from O(N M ) to O(N M ), and allows for efficiency gains even with moderate sparsity, especially for large L, N and M .

Experimental Setup
Machine Translation We train translation models on the WMT14 English-German translation task (En-De) (Bojar et al., 2014) and the WMT18 Chinese-English translation task (Zh-En) (Bojar et al., 2018). We use newstest2013 as the validation set for WMT14 En-De and newstest2017 for WMT18 Zh-En. We evaluate the translation quality with BLEU metric (Papineni et al., 2002), and report tokenized BLEU on newstest2014 for WMT14 En-De and detokenized BLEU on newstest2018 for WMT18 Zh-En using sacreBLEU (Post, 2018). We apply the byte pair encoding (BPE) algorithm (Sennrich et al., 2016) with 32K merging operations to handle rare words for both translation tasks.

Document Summarization
We train abstractive summarization models on the CNN/Daily Mail dataset (Hermann et al., 2015) and the WikiSum dataset (Liu et al., 2018) for single-and multidocument summarization task, respectively. We use the non-anonymized version of CNN/Daily Mail (Gehrmann et al., 2018). We pre-process this dataset with a BPE vocabulary of 32K and truncate each article to 400 subwords (Gehrmann et al., 2018). We use the ranked version of WikiSum (Liu and Lapata, 2019), where top-40 paragraphs are extracted for each instance paired with a summary of 121 words on average. We concatenate all these paragraphs into one source sequence following the given ranking order. We employ BPE preprocessing following Liu and Lapata (2019) and truncate each source sequence to 2048 subwords. We evaluate the summarization quality using the F 1 score of ROUGE-L (Lin, 2004).

Model Settings
We formulate all the above tasks as sequence-to-sequence tasks, and experiment with the base setting of Transformer (Vaswani et al., 2017): d = 512, the middle layer size of FFN(·) is 2048, and the number of attention head is 8. Following Louizos et al. (2018), we set = −0.1, and β = 2 /3 for L 0 DROP. We tune the hyperparameter λ for different tasks, as discussed in detail in the following sections. Extra details are provided in Appendix.

Results and Analysis
How much can encoder outputs be sparsified? We answer this question by analyzing the impact of pruning source encodings on the generation quality. We first train a baseline Transformer model, and then finetune this model using L 0 DROP (Eq. (12)) with varied λ to explore different levels of sparsity. We sample λ with a range of (0, 1.5] and a step size of 0.1, and finetune WMT14 En-De and WMT18 Zh-En models for extra 50K steps, and CNN/Daily Mail for extra 20K steps. We use the sparsity rate to measure the sparsity; we define it as the ratio of the pruned source encoding number #(ĝ i = 0) to the total number of source words. Figure 4 shows the results. The generation quality exhibits a negative correlation with the sparsity rate across different tasks, reflecting the usefulness of encoder outputs for generation. However, the fact that we can remove about 40% source encodings without largely degrading the generation performance (-0.5 BLEU and -0.1 ROUGE-L) supports our hypothesis that we can force Seq2Seq model to route information through a subset of its source encodings. We also observe that the compressibility seems relatively language independent (the curves of WMT14 En-De 4(a) and WMT18 Zh-En 4(b) are similar) but clearly task dependent. x-axis denotes the overall sparsity rate. The encoding of content words and BPEH is more valuable for generation, compared to that of function words and punctuation.
Compared to translation tasks, the summarization task is less sensitive to the pruning of source encodings (-1.89 ROUGE-L versus -3.0 BLEU at a sparsity rate of ∼70%). We ascribe this to the property of summarization where the summary only reflects a part of the input document, rather than the entire document.
Note that the pretraining-then-finetuning schema is mainly used for saving training efforts. By scheduling λ linearly with training steps, we can train models with L 0 DROP (Eq. (12)) from scratch, and obtain a BLEU score of 27.03 (λ = 0.2, warmup step of 200K) on WMT14 En-De, comparable to using finetuning (27.04).
What types of source encoding are required for generation? Our goal here it to understand encodings of which types of tokens are retained. For each source encoding, we regard the POS of its corresponding word as its type. We take WMT14 En-De as our benchmark, where we annotate POS for source sentences in the test set using the Stanford POS tagger (Toutanova et al., 2003). We handle subwords separately by labeling its first piece as BPEH while the others as BPEO, regardless of the POS of its unsegmented form. We group different POS tags into 6 categories for the sake of analysis: BPEH, BPEO, function words, content words, punctuation and the rest. 5 Figure 5 shows how the sparsity rate of each encoding type changes as a function of the overall sparsity rate. We find that L 0 DROP first choose to eliminate the encoding of punctuation, followed by that of function words. These words often signal structural and grammatical relationships that, while important to build up a representation of the sentence, can be easily compressed. In contrast, pruning content words, which express richer lexical meaning, is more difficult. The sharp increase of content word sparsity after the overall sparsity rate of 0.5 in Figure 5 correlates with a sharp drop in translation quality (see Figure 4(a)). We also observe that there is a large difference between BPEO and BPEH, albeit both from the same word. L 0 DROP favours to prune the encoding of BPEO, indicating that the model learns to use word-initial representations (BPEH) to represent whole words.
What's the effect of L 0 DROP on Transformer? Transformer can lose the access to around 40% source encodings while largely retaining the same performance. We try to figure out what has changed inside Transformer in order to support L 0 DROP, and analyze the attention weights (i.e. A in Eq. (1)) of all encoder-decoder attention sublayers and the last encoder self-attention sublayer; these sublayers are directly connected with L 0 DROP in the and the pruned ones (bottom) versus the sparsity rate on the WMT14 En-De test set. We use the sparsity variableĝ learned by L0DROP to classify the encodings of our baseline Transformer. Higher entropy indicates that the distribution tends to be uniform. With fewer retained encodings, Transformer tends to spread its attention weights to include more sourceside information.
computation graph. We experiment on WMT14 En-De. We visualize the distribution of the encoderdecoder attention weight per source word for Transformer with a sparsity rate of 47% (BLEU 27.06). Compared to the vanilla Transformer (Figure 1), distributions in Figure 6 show that the average attention weight obtained by each source word has increased (+0.77, 1.03→1.80), and the proportion of source words receiving attention weights of less than 0.6 is substantially reduced, by a factor of 10 (49.7%→4.5%). This indicates that L 0 DROP forces Transformer to distribute its attention more evenly among the retained source encodings.
Apart from the encoder-decoder attention, we also inspect the self-attention in the last encoder layer. We average the self-attention weights over 8 different heads, and compare the attention entropy with λ = 0.3. "Time": the decoding time (in seconds) of the whole test set. "Sparsity": the sparsity rate, 0.00% indicates the Transformer baseline. "Speedup": the decoding acceleration over the baseline. "Quality": BLEU for WMT tasks and ROUGE-L for summarization tasks. We evaluate the decoding time on GeForce GTX 1080 Ti, with a batch size of 32 for WMT tasks and 10 for summarization tasks.
of the retained source encodings (ĝ i = 0) and the pruned ones (ĝ i = 0). We report average entropy values over the whole test set. Figure 7 shows how increasing sparsity affects the entropy. Although L 0 DROP selects to drop uninformative encodings, the increase in the entropy of the retained encodings (Figure 7 (a)), when compared to the baseline, suggests that the encoder actually encodes more context information into these representations, confirming that the model learns to compress context information when sparsity is enforced. Another observation is that the entropy curve of L 0 DROP for the pruned encodings is in line with that of the baseline, albeit on a larger scale (Figure 7 (b)). This signifies that L 0 DROP adapts Transformer to better coordinate with source context representations, which ensures its effectiveness on generation.
Can we prune encodings earlier in the encoder? Rather than stacking L 0 DROP on top of the encoder outputs, we insert L 0 DROP in-between every adjacent pair of encoder layers. We work on WMT14 En-De and finetune with λ = 0.2. We get a sparsity rate of 0.0%, 0.0%, 8.6%, 8.6%, 8.7% and 34.0% for the first to the last L 0 DROP layer, respectively, with a BLEU score of 26.74. This result suggests that Transformer does not gain much benefit from pruning encodings earlier. The model tends to retain encodings at shallow levels (0.0%/8.6% < 34.0%), and loses 0.3 BLEU compared to its L 0 DROP baseline (λ = 0.2, sparsity rate 31.7%, BLEU 27.04). We believe that the encoder relies on low-level information (including the words) to fully 'understand' the sentence, though part of the final encodings is discardable.
Can we make the decoding faster with L 0 DROP?
With appropriate finetuning, L 0 DROP can shorten the encoding sequence fed to the decoder, reducing the calculation amount of the encoder-decoder attention. However, the encoder-decoder attention corresponds to about 1 /3 of the decoder calculations, 6 and Algorithm 1 also brings in extra overhead, such as gathering and indexing operations. Thus, a speed-up is not guaranteed, and we report empirical decoding time across different tasks. Results in Table 1 show that L 0 DROP only marginally improves the decoding speed for machine translation, despite a high sparsity rate of 46.7% (WMT14 En-De) and 39.1% (WMT18 Zh-En). By contrast, L 0 DROP yields a speedup of 1.21× and 1.65× on CNN/Daily Mail and Wik-iSum, respectively. One explanation lies at the significant difference in target sequence length, where the average length per summary is >60, compared to ∼25 in machine translation. Note that L 0 DROP achieves a substantially higher sparsity rate of 71.5% on WikiSum with the same λ = 0.3. This is because the input paragraphs overlap in content; the information about redundant words does not need to be routed into other encoder states, making easier to prune them.

Exploring Rule-based Sparse Patterns
Our analysis shows that the sparsity induced by L 0 DROP follows certain patterns, with the encodings of 'less content-bearing' words pruned first. This suggests that we may be able to define heuristic patterns manually. In this section, we explore the following three rule-based patterns according to our study on WMT14 En-De: POS Pattern This pattern discards the source encodings of those easy-to-prune types, including function words, punctuation, BPEO and MD, EX, which account for 46.4% of the source-side WMT14 En-De training data. Freq Pattern Inspired by the fact that punctuation and function words are high frequency words, we propose to filter out the source encodings corresponding to top-frequent words with a threshold of 46.3% (top 100 words). We also include an inverse version, Inv Freq Pattern, for comparison, which drops the encodings of most rare words; source words whose frequency ranks lower than 452 are removed,  covering ∼40.0% of the source training data. Group Pattern We explore a position-based pattern that only feeds the encodings at odd positions to the decoder, indicating a sparsity rate of ∼50%. This pattern is partially motivated by Child et al. (2019).
Note that the design of these patterns follows our analysis on L 0 DROP, where we match the sparsity rate in each pattern to the optimal rate of L 0 DROP on WMT14 En-De. We examine the feasibility of these patterns on WMT14 En-De and CNN/Daily Mail. Table 2 shows the results. On WMT14 En-De, Transformer using these rule-based patterns achieves comparable translation quality to L 0 DROP (-0.24 to +0.05 BLEU) with similar sparsity rate. One interesting observation is that Transformer also works with language-and context-agnostic sparsity patterns (Freq Pattern). The performance drop by Inv Freq Pattern (-0.64 BLEU) is in line with the information-theoretic expectation that information from frequent words is easier to compress than that of rare words.
However, note that we developed our heuristics to mimic the behaviour of L 0 DROP for WMT14 En-De task. L 0 DROP has the advantage that it is data-driven and task-agnostic so that we can easily apply L 0 DROP to summarization. By contrast, these rule-based patterns discovered on translation tasks are not optimal for other tasks, which results in deteriorated performance on CNN/Daily Mail (-5.82 to -0.84 RL). In particular, Transformer suffers from the largest performance drop with the Group pattern (-5.82 RL). These results suggest that using rule-based sparse patterns to manually define the sparsity of encoder outputs is possible though the patterns lack generalization ability to different tasks.
By introducing a L 0 -regularized neural sparsity layer (L 0 DROP) in Transformer, we confirm that the encoder outputs are compressible to varying degrees. Pruning encoder outputs often results in a drop in performance, but we can get comparable results with 40-70% source encodings dropped. One benefit of pruning source encodings is to shorten encoding sequences for the decoder, which accelerates the decoding speed by up to 1.65× on document summarization tasks. Our analysis on WMT14 En-De shows that L 0 DROP learns to drop the encodings of (relatively frequent) function words and retain encodings of (relatively rare) content words, but relies on self-attention to reroute information from these to-be-pruned positions. Based on our analysis, we define rule-based sparsity patterns, which also allow for compression without degrading translation quality much, and show that frequent tokens are more amenable to sparsification than rare tokens. However, we find that our rule-based patterns do not generalize across tasks, while L 0 DROP is data-driven and applicable across tasks. We hope that, besides practical implication, our work contributes to better understanding encoder-decoder models.