Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation

The multi-head self-attention mechanism of the transformer model has been thoroughly investigated recently. In one vein of study, researchers are interested in understanding why and how transformers work. In another vein, researchers propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we combine these two lines of research in a human-in-the-loop pipeline to first discover important task-specific attention patterns. Then those patterns are injected, not only to smaller models, but also to the original model. The benefits of our pipeline and discovered patterns are demonstrated in two case studies with extractive summarization and topic segmentation. After discovering interpretable patterns in BERT-based models fine-tuned for the two downstream tasks, experiments indicate that when we inject the patterns into attention heads, the models show considerable improvements in accuracy and efficiency.


Introduction
With transformer-based models (Vaswani et al., 2017) dominating the leaderboard for many key NLP tasks such as summarization (Liu and Lapata, 2019), topic segmentation (Lukasik et al., 2020), and sentiment analysis (Adhikari et al., 2019), their core multi-head self-attention mechanism has also been thoroughly investigated.In particular, to explain why and how transformers work, researchers have analyzed the learnt self-attention matrices of trained transformer models (e.g., Raganato and Tiedemann (2018); Kovaleva et al. (2019)), with Vig and Belinkov (2019) for instance, exploring the attention patterns in BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), as well as analyzing their alignment with syntax.
Meanwhile, a parallel line of research has explored injecting predefined patterns into attention matrices of transformers in an attempt to reduce the run-time complexity of self-attention while maintaining competitive accuracy.This can be done by either replacing the attention weights with a fixed matrix (Raganato et al., 2020;Tay et al., 2021;Xiao et al., 2020); or alternatively by guiding the attention weights through more flexible masking strategies (Mihaylov and Frank, 2019;Child et al., 2019;Guo et al., 2019;Li et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020;Bai et al., 2021).
In this work, we propose and test a novel humanin-the-loop pipeline that to the best of our knowledge is the first attempt to combine research on analyzing self-attention with work on injecting patterns into attention matrices.To start, human users visually explore the attention matrices of transformers to identify task-specific patterns that could be formalized as a predicate.After quantitatively evaluating the patterns on the validation set, they can be injected into attention heads of transformer models to simultaneously improve task accuracy and make the model more efficient by sparsifying the attention matrices1 .This is in contrast to previous work that mostly focuses on the trade-off between two metrics.
In both scenarios, we argue the interpretability of the resulting model is improved.We provide a justification of our claim based on the Predictive, Descriptive, and Relevant (PDR) framework proposed by Murdoch et al. (2019).Specifically, by injecting human-interpretable patterns into the model, we increase the model's descriptive accuracy by explicitly encoding useful relationships between input tokens in the attention weights while simultaneously improving the predictive accuracy in task performance.Further, the patterns are relevant for the problem since they are discovered in the human-in-the-loop process and are verified to be important for the task.
In order to test the feasibility and potential benefits of our approach, we run two case studies on the tasks of extractive summarization and topic segmentation using BERT-based models, and we find that: (i) Some of the important heads do have patterns with interpretable meaning, either lexical, local or positional.For instance, the matching token (i.e. the trend of attending to other tokens with the same id) is an important clue for the summarization model.(ii) We show that when the discovered patterns are injected to the attention heads of transformer models, both the task accuracy and efficiency of the model can be significantly improved.
(iii) Additionally, we also propose a strategy to improve the performance of pretrained transformer models by injecting patterns through PALs.

Related Work
In §2.1 and §2.2, we describe the two lines of research that our work aims to combine.§2.3 summarizes recent trends on enhancing the interpretability of neural NLP models, while §2.4 introduces the two NLP tasks used for our case studies.

Attention Analysis in Transformers
Various works have investigated the attention head matrices in transformers (Raganato and Tiedemann, 2018;Clark et al., 2019;Kovaleva et al., 2019;Zhao and Bethard, 2020;Xiao et al., 2021), often with the aid of visualization tools (Vig, 2019;Hoover et al., 2020;Li et al., 2021).For examples, Vig and Belinkov (2019) visually explore attention patterns in BERT and GPT-2, analyzing their alignment with syntax.While Voita et al. (2019) characterize the functions of the attention heads in Machine Translation (MT) models (positional, syntactic, and rare words), and evaluate the importance of those head functions.More recently, Bian et al. (2021) find the redundancy in BERT's attention patterns to be both phase-independent (pretrained and fine-tuned) and task-agnostic.Lastly, Huber and Carenini (2022) infer discourse structures from the attention patterns of language models (BERT and BART), and find discourse information to be consistently captured in the same heads even when fine-tuned for different tasks.In this paper, we also aim to find task-specific important attention patterns, but in contrast to previous work that identifies and categorizes attention patterns, we propose a pipeline to leverage these patterns in improving models' performance and interpretability.

Attention Augmentation
We organize the related work on augmenting attention matrices into two categories.In the first category, attention weights are completely replaced with a fixed matrix.For example, Raganato et al. (2020) use fixed positional patterns in MT models and demonstrate benefits for low-resource scenarios, while Tay et al. (2021) replace the weights computed using dot-product self-attention with a random matrix, and report comparable performance with standard transformers.Later on, Xiao et al. (2020) expand their work by using embedded RSTstyle discourse trees as fixed attention matrices and show the effectiveness of discourse-based attention matrices for extractive summarization.In contrast, in the second category of attention augmentation works, masks are applied on top of the attention weights to either inject linguistic information (Yang et al., 2018;Mihaylov and Frank, 2019) or improve the efficiency of self-attention via fixed patterns (Child et al., 2019;Guo et al., 2019;Li et al., 2019;Ainslie et al., 2020).Just to describe a few prominent examples, Strubell et al. (2018) use bi-affine attention to learn syntactic dependencies in attention heads, and Bai et al. (2021) inject syntactic structures into BERT through extra attention layers.Concurrently, while Beltagy et al. (2020) use diagonal/vertical/horizontal patterns to respectively model local and global context, Zaheer et al. (2020) add patterns randomly by drawing inspiration from graph theory.In comparison, while in all previous works the designing of pre-defined patterns requires extensive trial and error, and only improve upon either the accuracy or efficiency at the expense of the other, we explore a strategy of discovering and assessing important attention patterns interactively in this paper.Not only do the discovered patterns help improve performance in terms of both accuracy and efficiency, they also reveal valuable insights regarding the internal workings of pretrained language models.

Model Interpretability
In the context of Machine Learning, interpretability can be defined as the description of the internals of a model in a way that is understandable to humans (Gilpin et al., 2018).With the rise of deep learning, various techniques have been proposed to interpret the inner workings of neural NLP models.For example, probing classifiers are often used for finding linguistic or knowledge information learned by neural networks (Conneau et al., 2018;Tenney et al., 2019;Pimentel et al., 2020;Voita and Titov, 2020;Hou and Sachan, 2021;Aghazadeh et al., 2022), while behaviour testing aims at understanding how models behave through inferences under different controlled settings (McCoy et al., 2019;Ross and Pavlick, 2019;Ribeiro et al., 2020;Koh et al., 2021;Goel et al., 2021).In contrast, our work is an example of making interpretability an inherent attribute of the neural models (e.g.Chen and Ji (2020); Hu et al. (2021)), with human-distinguishable patterns revealing insights regarding a subset of parameters in the model.

NLP Tasks used in the two Case Studies
Extractive summarization is the task of picking the most representative sentences as the summary for the given document(s).Current state-of-theart models, which are mostly based on large-scale pretrained language models (Liu and Lapata, 2019;Zhong et al., 2020;Jia et al., 2020;Ruan et al., 2022), can deliver good performance, but why and how such models work so well still remain an open question.In our case study, we adopt the popular BERTSum (Liu and Lapata, 2019).
Topic segmentation is the task of breaking stretches of running text into smaller topicalcoherent segments consisting of one or more sentences addressing a common topic.Recently, more research work frames the task in the supervised learning paradigm and uses neural models such as Bi-LSTMs (Koshorek et al., 2018;Xing et al., 2020) and transformer (Glavas and Somasundaran, 2020;Lo et al., 2021) as the backbone, due to the availability of large-scale labeled benchmarks sampled from Wikipedia.These proposed neural topic segmentation models achieve state-of-the-art performance on monologue text by formulating the problem as a sequence labeling task, where the predicted label of each sentence indicates whether or not it is the end of a segment.In our case study, we adopt Cross-Segment BERT (Lukasik et al., 2020).

Proposed Generic Pipeline
As an overview, we first briefly describe the proposed pipeline (Figure 1).Specifically, given a trained model, users are asked to first discover important patterns using the visual interface (Li et al., 2021) by following three steps: Step 1 ( §3.1.1):Estimate the importance scores for all the heads on the validation set, and find important heads that stand out.
Step 2 ( §3.1.2):Discover relevant patterns in the important heads, using criteria described in §3.1.2.
Step 3 ( §3.1.3):Evaluate and validate the patterns to confirm their global relevance.
Once the important patterns are identified, there are two common approaches (i.e.fixing and masking) to inject them as constraints to the attention matrices in the transformer-based neural models (see §3.2).The pipeline also enables two scenarios, in which injecting the patterns can be beneficial: the first one is to train a new model with the patterns injected, while the second one is to enhance the original model.

Discover Patterns from Attention
In this section we provide details of the three steps for discovering patterns from the attention heads.The three steps are illustrated in Figure 1 (B).

Step 1: Estimate Head Importance
Although the multi-head self attention mechanism in transformers allows the model to learn multiple types of relationships between input representations across a single hidden layer, the importance of the individual attention heads can vary depending on the downstream tasks.In practice, we propose the use of scalable gradient-based methods (Michel et al., 2019;Voita et al., 2019;Molchanov et al., 2019) for an efficient estimation of head importance, and take the top-K heads at each layer to find important patterns for the task ( §3.1.2).Note that K can be adjusted based on the availability of human users and the size of the model.

Step 2: Find Attention Patterns
Once the the most important heads are identified, their attention distributions are inspected to look for patterns.
We define an attention pattern to be interpretable iff it can be modeled as a predicate P between any pair of input tokens (x i , x j ).For instance, the positional pattern 'preceding token' would be true if x i appears before x j .Candidate patterns can be discovered following two criteria: 1) they are beneficial for the downstream task; 2) they occur consistently among relevant tokens.

Step 3: Evaluate Attention Patterns
With a pattern discovered in §3.1.2,this step confirms the pattern's global relevance by empirically measuring the proportion of attention values aligning with the pattern.For each attention head, the associated predicate is evaluated over the entire validation set to ensure the pattern is not appearing by chance on the certain data that the user happen to look at.
Specifically, we define the global relevance (GR) of a pattern P for a head h as follows: |x| (1) where the attention value from the token x i to x j on the head h for an input sample x, denoted as α (x,h) i,j , is aggregated if and only if P (x i , x j ) holds.To validate a pattern's generality, the relevance is averaged over the validation set X.

Inject Patterns
As illustrated in Figure 1 (C), after extracting the patterns following the three steps in §3.1, we propose to inject the patterns into attention matrices with two methods ( §3.2.1), and discuss two practical scenarios ( §3.2.2) where they can be beneficial for the downstream tasks.

Methods for injecting Patterns
In this work, we inject the discovered patterns by either fixing or masking the attention weights prior to the softmax function.For fixed attention weights, the attention logits in the scaled-dot-product attention is replaced with a fixed (possibly input dependent) matrix such that: where σ is the softmax operation, V is the value vectors, and F (X) ∈ [0, 1] computes a binary ma-trix from the input sequence X based on the predicated P for the specific pattern.Similarly, a pattern can also be injected by casting a mask over the attention weights computed from the key and query vectors, as: (3) where M (X) ∈ [0, −∞) computes the desired behaviour in the same fashion as F (X), and is added to the attention logits to approximate the multiplication of the attention distribution by a weight.
Although the two methods are very similar with respect to the improvement they contribute (see §4), masking allows more flexibility and is generally used for patterns with a large number of applicable tokens, while fixing is more rigid and better suited for a small number of applicable tokens.

Scenarios for Injecting Patterns
In practice, patterns can be injected in at least two scenarios: (i) injecting patterns directly into the attention heads of transformer-based models, and (ii) injecting patterns into pretrained transformer models using techniques such as the Projected Attention Layers (Stickland and Murray, 2019).We conduct case studies for these two scenarios in §4.

Case Studies
In this section, we demonstrate the effectiveness of our pipeline in two NLP tasks (extractive summarization and topic segmentation) and discuss our findings in detail.

Models for Tasks
We adopt the popular BERTSum (Liu and Lapata, 2019) for extractive summarization.With the con- textualized representation from BERT, the model uses a binary classifier to predict whether each sentence belongs in the summary.We train the model on the CNN/DM dataset (See et al., 2017), and use ROUGE (Lin, 2004) as the evaluation metric.
We adopt Cross-Segment BERT (Lukasik et al., 2020) for topic segmentation, where a candidate segment boundary is first represented by its left and right context, and then passed through a binary classifier to predict whether the candidate is a topical segment boundary.The model is trained on the WikiSection dataset (Arnold et al., 2019), and the F1-score is used as evaluation metric for validation.

Discover Patterns from Attentions
Using the two models from §4.1, as we discover that similar attention patterns exist in the important heads for both tasks, the two case studies are presented together.Without loss of generality, we 2 (A) and (B) of Figure 2 are captured from the visual interface presented in Li et al. (2021).
will use extractive summarization as the running example task (Figure 2) to illustrate the process of pattern discovery.We also apply the same process to topic segmentation.

Find Important Heads
We adapt the Taylor expansion method (Molchanov et al., 2019) as a proxy score for the head importance estimation.Following Li et al. (2021), we use the first-order expansion to avoid the overhead from computing the Hessian, where the gradient w.r.t. the validation loss is summed over all parameters of an attention head to estimate its importance.
The importance score heatmap of all heads is visualized in Figure 2 (A), revealing that head importance is not uniformly distributed, i.e. a small number of heads play a dominant role for the summarization task, as observed in Michel et al. (2019).

Discover and Evaluate Patterns
To discover task-specific patterns, we analyze the top-3 most important heads of each layer, and look for human-interpretable relationships encoded in the attention weights.In practice, we use the instance-level interactions provided by the visual framework (Li et al., 2021), and randomly select 5 validation examples per task for our analysis.The entire process takes less than one hour to complete for each task, where we manually examine the attention weights for less than half of the tokens for each example.It is worth noting that detailed analysis regarding the trade-off between human cost and pattern recall would require extensive user studies beyond the scope of this work.
After discovering patterns, we assess the global relevance of each patterns on the validation set, where the pattern is kept only if the corresponding predicate P exists in at least one significantly relevant head.In our case studies, we use the 3sigma rule to determine the significance of a pattern.Specifically, patterns with at least one head over 3 standard deviations above the GR mean (over all the heads) are kept for further applications.
After verifying on the validation set, we discover three patterns consistently existing in both tasks (over 50% of important heads).This suggests that important patterns are generalizable across multiple NLP tasks, which is consistent with the findings in Bian et al. (2021).Further analysis also shows that the attention patterns are consistent after fine-tuning, where we report an average Jensen-Shannon Divergence of 0.01 between the attention distributions of BERTSum across 3 random seeds.We hope our findings provide motivation for the in-depth study of pattern importance in different NLP tasks.Lastly, while it may be argued that this step of the pipeline can be automated by directly evaluating the importance and relevance of predefined patterns (e.g.syntax, discourse) based on intuitions, as indicated below, our interactive approach allows the discovery of interpretable patterns which otherwise would be hard to define due to the infinite search space of possible patterns.Next, we describe the three discovered patterns in detail.
Matching Token (Green in Figure 2) This pattern describes the "attending to matching tokens" behaviour, where the attention value α h i,j between input tokens x i and x j on the head h is high whenever x i = x j .For example, as shown in Figure 2 (i), the token "photo" mostly attends to other appearances of the token "photo" in the input sequence.To evaluate whether this pattern has a large global relevance for any head, we only consider tokens that appear at least twice within a single documents, and compute GR (Eq.1), in which P (x i , x j ) holds if and only if x i = x j , i.e.
The evaluation results show that there are several heads for which the matching token pattern has high global relevance (See the Green box in Figure 2).Interestingly, these heads are more prominent (in the importance heatmap) for the extractive summarization task, suggesting this pattern is especially important for summarization models during inference.
Intra-Sentence/Context (Olive in Figure 2) This pattern describes the behaviour of only attending to tokens within a text span.For summarization, these heads will focus on attending tokens within the same sentence (Figure 2 (ii)).Similarly, the same heads in topic segmentation models will focus on attending tokens within the same context (left or right).To evaluate this pattern, GR is computed with P (x i , x j ) holding iff x i and x j occur within the same text span.Figure 2 (C) reveals that this pattern appears more frequently in the mid to upper layers of the transformer encoder.
Positional (Blue in Figure 2) Similar to Kovaleva et al. (2019), we also observe "positional heads", which focus specifically on either the preceding or following tokens, i.e., either α h i,i−1 or α h i,i+1 have high values (Figure 2 (iii)).To evaluate this pattern, GR is computed with P (x i , x j ) holding iff j = i − 1 for preceding postional heads and j = i + 1 for succeeding positional heads.The pattern is verified to exist in the lower layers of the encoder, shown in the blue boxes of Figure 2 (C).
Other Patterns In addition to the three patterns mentioned above, we also observe heads that focus on attending to special tokens (e.g.[CLS], [SEP]) or punctuations (e.g.periods).However, we find that attention heads with this behaviour are generally less important for the task (outside top-3), and therefore omitted them from the next step of our pipeline.
On the other hand, we also find uninterpretable attention patterns in some of the important heads of each layer.As hypothesized by previous works (Clark et al., 2019), these attention heads might be performing complex linguistic operations in combination with other heads.We leave the verification, interpretation and the efficient injection of these patterns into the models as a direction for future work.

Injecting Patterns to Models
After uncovering potentially important patterns and confirming their relevance, we inject them to transformer-based models for the task of summarization and topic segmentation through masking and fixing the attention weights.While we only perform the pattern discovery process on the CNN/DM and WikiSection datasets, we inject the discovered patterns to two other datasets (NYT-50 (Sandhaus, 2008) for summarization and Wiki-727K (Arnold et al., 2019) for topic segmentation) to demonstrate that our discovered patterns are generalizable in "cross-dataset" settings 3 .

Method for Fixing and Masking
The patterns identified from our analysis can be injected into an attention head through masking or fixing its corresponding attention weight matrix.Specifically, for the matching token pattern, we apply an attention mask which enforces that when a token appears more than once in the document, it should attend only to other occurrences of itself: where the constraint is removed for tokens occurring only once in the document.
Similarly, for intra-sentence/intra-context attention, the attention mask specifies that only tokens within the same boundary can attend to each others, where: Lastly, we use a fixed attention matrix to encode the two positional patterns with: being the same, but equal to 1 for j = i + 1.We use fixed attention matrices for these patterns to save computational overhead since it has the same effect as applying the mask (each row is a one-hot vector).This is similar to the method proposed by Raganato et al. ( 2020), but we only fix for the preceding and succeeding token patterns.

Pattern-Infused Sparse Transformers
In the first round of experiments, we inject the four patterns on smaller transformer models to demonstrate their effectiveness on both tasks.Since the goal of these experiments is to assess the benefits brought by these patterns, we do not perform extensive hyper-parameter search when injecting these patterns (e.g. on which layer, etc.).
Under both settings, each of the four patterns (including two positional patterns) is injected in a separate attention head across all layers in the model.Motivated by studies on the trade-off between sparsity ratio and task performance, we adopt the sparsity ratio used by previous works (Shi et al., 2021;Wang et al., 2022) , where |M | denotes the number of non-zero elements in the attention mask, and N denotes the length of the example.Given the sparsity ρ, the complexity of self-attention is thus reduced to O (1 − ρ)n 2 (Shi et al., 2021).To investigate how the sparsity ratio affects the performance of our pattern-infused models, we experiment with different number of heads to inject our patterns, where the sparsity ratio increases along with the number of heads (with patterns).
As shown in Table 1, our pattern-infused models outperform the plain transformer models for both the CNN/DM and NYT-50 datasets under all three settings (6 Layer 8 Heads, 6 Layer 12 Heads, and 6 Layer 12 Heads with BERT embeddings).Similarly for topic segmentation, results also show that the pattern-injection approach substantially outperforms the vanilla transformer across all metrics.It is worth emphasizing that the performance gain is slightly higher for summarization models.When normalized by the ROUGE scores of extractive oracle summaries4 , the pattern-infused summarization models achieve an average 15% improvement over the baselines, while the topic-segmentation models achieve a 12% improvement over the baselines.Inline with prior work (McCoy et al., 2020), we also find that the performance is consistent across random seeds, where we report an extremely low standard deviation of 0.03 (ROUGE) and 0.002 (F1) for extractive summarization and topic segmentation, respectively.Overall, the results from our experiments convincingly demonstrates the benefits of our approach and the generalizability of the Model Sparsity (ρ) Extractive Summarization Topic Segmentation  patterns discovered by our pipeline.In addition, while a higher sparsity ratio causes a slight decrease in performance under some scenarios, we find that even with a ratio of 0.86, our model still significantly outperforms the vanilla transformer across all settings.This is in contrast to the findings by previous work (Child et al., 2019;Guo et al., 2019;Li et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020;Shi et al., 2021), where the high sparsity ratio from fixed patterns often results in performance degradation from the vanilla transformer.These findings from our work provide crucial insights for designing more energy efficient models in the future.
Overall, with the discovered patterns injected, our models are arguably more interpretable than plain transformers on both tasks, as we know with certainty the information encoded in each masked/fixed attention heads.To further justify our claim of interpretability, the attention heads with patterns injected tend to have higher importance scores than the other heads 5 , suggesting that such 5 An illustrative example is shown in Appendix C.1 patterns are effectively leveraged by the model.
To study the contribution of individual patterns, we perform an ablation study by injecting all combinations of patterns on CNN/DM using the transformer model with 6 layers and 8 heads6 .From Table 2, we observe that injecting matching token and intra-sentence together achieves the strongest improvement in accuracy among all combinations, only slightly lower than injecting all patterns.Meanwhile, the gains from injecting patterns separately are only marginal.One intriguing explanation is that these two patterns allows the model to learn sentence-level features based on term frequency (plausibly similar to TF-IDF (Jones, 1972)), where higher scores are assigned to sentences containing frequently appearing tokens.Additionally, although injecting only the positional patterns causes the performance to degrade, it works better when combined with the two other patterns.We hypothesize that positional patterns need to be combined with patterns with more global context in order to be more effectively utilized.

Guided Pattern Injection into
Pre-trained Models We then experiment with injecting the patterns back into the pre-trained transformer encoder.In particular, we inject them through additional attention heads in the form of a Projected Attention Layer (PAL) (Stickland and Murray, 2019), along with the parameters of the original model.Details regarding PALs are described in Appendix A. BERTSum 42.33 19.88 38.86 48.37 29.25 40.72 + PAL 42.34 19.88 38.86 48.56 29.41 40.91 + PAL + Patterns 42.58 20.05 39.10 48.74 29.60 41.11 Table 3: ROUGE F-scores of PAL with pretrained models for extractive summarization.All metrics were significantly better than the baselines with a confidence level of 99% according to the Bootstrap Significance test (Dror et al., 2018).
The hidden size of our PALs is 256, which consists of 4 additional attention heads (d k = d v = d q = 64).PAL is added in each of the 12 BERT layers, where our patterns are injected in the 4 PAL attention heads.To ensure the changes in performance are due to the patterns rather than the additional parameters, we also compare against adding PAL without injecting the patterns.
Results in Table 3 indicate that injecting the patterns in PAL (+PAL+Patterns) surprisingly improves BERTSum's performance on both datasets, where the performance gains on the NYT-50 are similar (or even slightly better) than on the indomain CNN/DM dataset, supporting the generality of the discovered patterns.Additionally, as it was the case for the transformers with patterns injected, visualizing the head importance scores reveals that the PAL heads with patterns injected are significantly more important (by two orders of magnitude) than the PAL heads without patterns injected 7 , indicating that the interpretable patterns are important features during model inference.
In summary, the key aim of our experiments was to verify consistent improvements over our own baselines under the same settings in order to probe the benefits (effectiveness and efficiency) of the discovered patterns for the task.Therefore, we do not perform extensive tuning to achieve the same results reported by Liu and Lapata (2019).

Conclusion and Future Work
In this paper, we propose a generic human-in-theloop pipeline, which combines two popular research directions, where the findings from an analysis of the multi-head self-attention mechanism in transformers can be utilized to create more accurate and interpretable transformer models.A human analyzes the attention heads of a task-specific model, discovers and verifies potentially meaningful patterns, and injects them into the attention heads of 7 An illustrative example is shown in Appendix C.2 models.By running a case study on two NLP tasks, we show the effectiveness of our pipeline.We do discover meaningful patterns in some important heads, and the relationships encoded in the patterns help us understand the features used by the model for both tasks.Furthermore, by injecting the patterns into the smaller models and the original model, the performance and interpretability get improved in both cases.
As future work, we plan to apply our pipeline to other NLP tasks (e.g.language modeling, abstractive summarization) and explore and verify whether the important patterns from one task can be transferable to another task.Similarly, we also plan to apply our pipeline to different model variants to examine and compare the patterns encoded in the attention weights.In the long term, our pipeline could be naturally automated by replacing the pattern discovery step with evaluating predefined linguistic patterns.However, assessing the efficiency gains from injecting such patterns (requiring ground-truth annotations) would require more in-depth studies beyond the scope of this paper.Finally, since human factors are an important aspect of interpretability, we plan to conduct extensive user studies across different NLP tasks and model sizes to examine the trade-off between human-cost and the coverage of discovered patterns.

Limitations
The scope of our case studies is limited to English datasets consisting of long documents for BERTbased models.Additionally, we only adopt the visual interface proposed by Li et al. (2021) due to its support for long documents, and leave the design and implementation of additional visualization techniques as a venue for future work.

C.2 Projected Attention Layer Heads
We visualize the important scores of the PAL heads for BERTSum trained on CNN/DM (Figure 4), where there are four heads added to each BERT layer via residual connection.Figure 4b shows the normalized importance score of the PAL heads without any patterns injected, where the model is opting to use almost entirely the representation from the BERT layers.In Figure 4a, where each of the four PAL heads are injected with our patterns, we can see that importance score significantly increased from the score without the patterns injected, indicating that the features encoded in our patterns are indeed being utilized by the models in addition to the existing pretrained representations.

D.1 Trigram Blocking
In our experiments, we follow previous work (Paulus et al., 2018;Liu and Lapata, 2019) in evaluating the models in two ways: with and without the trigram blocking.At inference time, the summary is usually formed by selecting sentences with the highest prediction scores.However, with the trigram blocking trick, sentences with overlapping trigram will not be selected.This trick has been shown to be an effective method to deal with redundancy on some dataset (e.g.CNN/DM), but may cause performance drop in others (e.g.Pubmed and arXiv).

D.2 Pattern-Infused Sparse Transformers
In Table 4, we show the trigram blocking results of the sparse transformer models on both summarization datasets, and Table 5 shows the trigram blocking results for pattern ablation experiment on the CNN/DM dataset.In line with §4.3.2, our pattern-infused models work better than all the other models on all of the settings on both dataset.
As for the ablation study, we see a higher performance gain with the matching-token pattern when trigram blocking is applied, where the best performing model is still the one with all patterns applied.10203 Table 6 shows that the results with trigram blocking.We find the performance gain from the patterns to be higher for CNN/DM and lower for NYT-50.

E Pattern Ablation for Topic Segmentation
Table 7 shows that applying all 3 types of patterns leads to the highest performance gain in F-1 score.This is inline with the ablation results for extractive summarization.

Figure 1 :
Figure 1: The overview of our proposed generic pipeline.Given (A) a trained model for a specific task, our pipeline can be divided into two main parts: (B) pattern discovery and (C) pattern injection.

Figure 2 :
Figure 2: Example of Pattern Extraction in the extractive summarization case study. 2 (A) We first find important heads, before (B) identifying the three interpretable patterns (highlighted in Green, Olive and Blue, respectively): (i) Matching token, (ii) Intra-Sentence, and (iii) Positional.. Finally, (C) each pattern is evaluated with the global relevance score (GR) on all of the attention heads.For the purpose of illustration, we display one attention head with significantly larger GR in for each of the three identified patterns.

3
Results shown in Sec. 4 are without the Trigram Blocking trick, and more results with it are in Appendix D

Table 1 :
Results for the two tasks (four datasets) under different settings, where we report the average performance across the top-3 checkpoints.The parenthesis (e.g.4/8) denotes the number of heads with patterns injected, while sparsity (ρ) is computed from the average of the 4 datasets.

Table 2 :
Ablation study on the CNN/DM dataset with the 6 Layer 8 Head transformer setting.

Table 4 :
The results for the summarization experiments under three settings with Trigram Blocking applied.

Table 6 :
ROUGE F-scores of pretrained models with PAL when trigram blocking is applied.

Table 7 :
Ablation study results on the WikiSection dataset with the 6-layer 8-head setting.