Enlivening Redundant Heads in Multi-head Self-attention for Machine Translation

Multi-head self-attention recently attracts enormous interest owing to its specialized functions, significant parallelizable computation, and flexible extensibility. However, very recent empirical studies show that some self-attention heads make little contribution and can be pruned as redundant heads. This work takes a novel perspective of identifying and then vitalizing redundant heads. We propose a redundant head enlivening (RHE) method to precisely identify redundant heads, and then vitalize their potential by learning syntactic relations and prior knowledge in the text without sacrificing the roles of important heads. Two novel syntax-enhanced attention (SEA) mechanisms: a dependency mask bias and a relative local-phrasal position bias, are introduced to revise self-attention distributions for syntactic enhancement in machine translation. The importance of individual heads is dynamically evaluated during the redundant heads identification, on which we apply SEA to vitalize redundant heads while maintaining the strength of important heads. Experimental results on widely adopted WMT14 and WMT16 English to German and English to Czech language machine translation validate the RHE effectiveness.


Introduction
Recently, self-attention network (SAN) (Lin et al., 2017) has been applied to various natural language processing tasks. Instead of drawing distanceaware dependencies like recurrent neural network (Hochreiter and Schmidhuber, 1997) and convolutional neural network (Kim, 2014), SAN captures short-and long-range relations between elements. SAN involves all signals with a weighted averaging operation, which may incorporate too many unrelated elements to concentrates on specific relations. Recent work has modified SAN to enhance specific relation learning. For example, in (Shen et al., 2018), a directional self-attention network (DiSAN) uses one to multiple positional masks to model the asymmetric attention between two elements and capture context-aware relations for all tokens.  modeled the local information by revising the attention distribution with a learnable Gaussian bias to focus on neighboring relations. (Shaw et al., 2018) extended SAN to efficiently consider distinct representations of the relative linear position relations between sequence elements. However, the above approaches consider the multi-head SAN as a whole but ignore unbalanced contribution distributions between heads.
Furthermore, multi-head SAN combines different attentions from multiple subspaces to construct Transformer (Vaswani et al., 2017) and achieves the state-of-the-art results in recent neural machine translation (NMT) tasks (Hassan et al., 2018). The very recent work (Voita et al., 2019) shows that the encoder-side individual heads in Transformer make different contributions, multi-heads can be classified into important heads and redundant heads and pruning redundant heads does not seriously affect performance. They also assume that important heads play various roles which influence the generated translations to different extents, including syntactic function (focusing on dependent relations), positional function (focusing on neighboring words), and rare words-based function.
To date, our understanding of the roles of dis-tinct multi-heads is very limited, with no systematic analysis available of the roles of different heads. In this paper, we precisely identify redundant heads at the encoder-side of Transformer and demonstrate the potential of syntactically reactivating the redundant heads to improve the multi-head SAN performance. Fig. 1 illustrates the different rationales of existing work against ours in multi-head SAN. The left part represents those approaches that directly enhance overall heads as a whole w.r.t. their designed functions but do not differentiate their roles. Such approaches may downplay the functions of important heads and the diversity of the multi-head mechanism. The middle part represents the methods that analyze contributions and functions of multi-head SAN and then prune the determined redundant heads but rely on those important heads only. As shown in the right part, this paper proposes a dynamic and unified strategy to identify redundant heads and then enliven them to fulfill their potential. By enlivening the redundant heads, our approach enhances the performance of redundant heads without sacrificing the essential functions of important heads. In addition, our method further increases the scale of important heads.
Specifically, we take NMT as an example to illustrate our method of identifying and reactivating the redundant heads in multi-head SAN. We firstly propose two novel Syntax-Enhanced Attention (SEA) mechanisms for machine translation: 1) the Dependency-Enhanced Attention to use a dependent matrix as mask to model the intensive attention between dependent elements and filter elements without direct dependent relations; and 2) Local-phrase-Enhanced Attention to incorporate a distinct and learnable relative local-phrasal position matrix as bias, which is transformed from a constituency tree under the rules of local-phrase. These syntax-enhanced attention mechanisms simulate the specific functions of important heads but differ from the existing self-attention improvement approaches. Compared to the dependency tree, there is distinct syntactic layer information for each word in the constituency tree, which is extracted to calculate the relative phrasal position to reflect syntactic relations between elements. To this end, we define a novel phrase type local-phrase to only extract syntactically related words as phrase by leveraging the constituency tree, regardless of sequence distance. Further, we propose a dynamic and lightweight Redundant Heads Enlivening (RHE) strat-egy for multi-head SAN to reactivate and enhance the roles of redundant heads. Lastly, a dynamic function gate is designed, which is transformed from the average of maximum attention weights to compare with syntactic attention weights and identify redundant heads which do not capture meaningful syntactic relations in the sequence.
We test the above design on three widelyused translation tasks WMT14 and WMT16 English→German and WMT16 English→Czech. Extensive analyses reveal that enlivening redundant heads in multi-head SAN beats improving overall heads, and the proposed syntax-enhanced attention mechanisms with dependency and local phrases further effectively improve the translation performance.

Related Work
One popular extension to the SAN is to revise attention distribution by static and dynamic biases. Different dimensions of biases have been considered, including directional relation (Shen et al., 2018) and localness (Sperber et al., 2018;Zhang et al., 2018a;. (Shen et al., 2018) improves SAN with directional masks and multidimensional features by explicitly revising attention distribution. In this paper, we focus on the explicit syntactic biases by proposing dependencyenhanced attention and local-phrase-enhanced attention. Several papers show that explicitly modeling dependency (Bastings et al., 2017;Nadejde et al., 2017) or phrase (Wang et al., 2017;Huang et al., 2018;Zhang et al., 2018bZhang et al., , 2020 is useful for tasks such as NMT. Related to our work, (Strubell et al., 2018) and (Hao et al., 2019) also modify parts of self-attention heads with syntactic information. However, they randomly assign heads instead of analysing the importance and function of each head in advance. (Sperber et al., 2018) restricts SAN with the neighboring elements and performs better for longer sequences in acoustic modeling and natural language inference tasks.  leverages Gaussian bias predicted by the query vector to dynamically model the localness for SAN.
Other work analyzes the attention weights of different NMT models (Ghader and Monz, 2017;Voita et al., 2018;Tang et al., 2018;Raganato and Tiedemann, 2018). (Voita et al., 2019) considers how different heads correspond to specific relations and proves that redundant heads can be pruned without greatly decreasing translation performance. However, they disregard the full potential of redundant heads as in our SEA. (Li et al., 2018) realizes the diversity of multiple attention heads and introduces a disagreement regularization to explicitly encourage the diversity. Nevertheless, they do not realize that only partial individual heads are redundant, which is a prerequisite for optimizing multi-head diversity.
In summary, while some of the related work recognizes the approach of revising attention distribution with bias, our work represents the first to propose a complement and precise strategy to analyze individual heads, identify redundant heads and then enliven them with syntactic bias.

Multi-head Self-attention
Multi-head SAN (Vaswani et al., 2017;Shaw et al., 2018;Shen et al., 2018; projects the input sequence to multiple subspaces (h attention heads), applies the scaled dot-product attention to the hidden states in each head, and then concatenates the output. For each self-attention head head i (1 ≤ i ≤ h) in the multi-head SAN for NMT, given an input sequence x = {x 1 , ..., x n }, each hidden state in the l-layer is constructed by attending to the states in the (l − 1)-th layer. Specifically, the hidden states of (l − 1)-th layer H l−1 ∈ R n×d h are firstly transformed into the queries Q ∈ R n×d h , the keys K ∈ R n×d h , and the values V ∈ R n×d h with three separate weight matrices, where d h represents the dimensionality of each head.
The hidden state H i of the l-th layer is calculated as: where Att(·) is a scaled dot-product attention model, defined as: √ d k is the scaling factor with d being the dimensionality of layer states.

Multi-head Analysis
In (Voita et al., 2019), a "confidence" scalar h conf is calculated as the average of maximum attention weights of all n source tokens in one head: represents the maximum attention weight to x i among all source tokens x j in the sequence. Further, a fixed gate value f gate (0 < f gate ≤ 1) is given that judges a head as important if h conf > f gate for all training examples and epochs. In addition, three head functions are identified according to the frequency of maximum attention weight assigned to a specific position: syntactic function, positional function, and rare words function.

The RHE Design
Fig . 2 shows the architecture of our proposed redundant heads enlivening (RHE) approach to identify redundant heads and then enliven them by revising self-attention distributions with a syntactic bias. RHE takes full advantage of the multi-head SAN by capturing both dependent and distinct phrasal relations. First, two Syntax-Enhanced Attention (SEA) mechanisms: Dependency Enhanced Attention (DEA) and Local-phrase Enhanced Attention (LPEA), are proposed. DEA disables the attention between elements without dependencies by leveraging the dependency mask, and LPEA precisely regulates the self-attention distribution by a distinct and learnable local-phrase bias. The bias represents relative local-phrasal position transformed from a constituency tree. LPEA precisely captures both short-and long-term syntactic relations. Second, the Redundant Head Identification module dynamically determines the importance and function of each head during the training process per the average sum of syntactic attention weights. Lastly, the self-attention of redundant heads is replaced by SEA to enliven their full potential and roles.

DEA: Dependency-Enhanced Attention
DEA is a syntactic extension of standard selfattention. DEA focuses on the internal dependency between elements. We place a dependency mask bias d to the logit similarity in Eq. (2):

Multi-Head Hidden States
Output ...  Given a dependency mask D ∈ {0, −∞} n×n , we set the bias d to a constant vector D i,j 1 in Eq. (4), where 1 is an all-one vector. Note that, due to the exponential operation in the softmax function, adding the alignment score with a bias d ∈ {0, −∞} n×n approximates to multiplying the attention distribution by a weight ∈ [1, 0).

Redundant Heads Identification
To encode the dependency information into this mask, we define the value of D i,j according to head-dependent relations Dep (x i , x j ) between elements x i and x j : In fact, Eq. (5) shows that we ignore the relations between independent word pairs (x i , x j ) by set D i,j = −∞; meanwhile, the attention weights are more concentrated on dependent word pairs. By assuming each dependent relation to be equally important, we do not assign different biases for different dependency word pairs by set D i,j = 0. This enhances the ability of self-attention to capture dependent relations.

LPEA: Local-phrase-Enhanced Attention
LPEA includes a distinct and learnable syntactic bias to revise the attention weights. A local-phrase bias p represents relative phrasal position information between x i and x j (x j ∈ local_phrase(x i )).
Meanwhile, it masks the attention between words not in local_phrase(x i ). Similar to DEA, we modify Eq.
(2) as: We further introduce the concept of local-phrase obtained from the constituency tree in terms of two rules, different from general phrases which mostly consist of neighboring words. A local-phrase contains syntactically related words regardless of sequence distance, hence local-phrase carries the distinct and hierarchical syntactic relations between elements.
• Rule 1: Given a constituency tree with m layers, the word x i and its ancestor node sequence ast = (ast layer(x i )−1 , ..., ast 0 ), we assume that its local_phrase(x i ) contains words which belong to the lowest multi-descendant ancestor ast layer( To obtain the local-phrase bias p, we firstly extract a relative phrasal position matrix RP from the constituency tree. As Fig.3 shows, first, given a matrix of RP ∈ R n×n , where each element represents the relative syntactic distance between words x i and x j . Then, for words x i and x j not in the same local-phrase (e.g. "Sharon" and "talk"), we set the relative position as ∞ (3 th row, 6 th column). Finally, for words which in a local-phrase, such as "held" and "talk", we calculate the relative phrasal position distance according to their relative phrase layer (Layer 3 − Layer 4 = −1) and set the RP 2,4 = 1. Accordingly, we obtain the matrix RP.
As the RP matrix cannot be directly encoded in attention distribution, inspired by (Shaw et al., 2018), We use a group of vectors to represent the relative phrasal position between words in RP.
Considering that the precise relative phrasal position information beyond a certain distance is not useful, the maximum relative phrasal position is clipped to a maximum absolute value of k. Therefore, we consider 2k + 1 unique edge labels for relative phrasal position vectors and transform the integral matrix RP into the corresponding vector matrix M ∈ R n×n×d h , where: Then, we learn the relative phrasal position repre- After obtaining the matrix M, we apply a feedforward network to transform the relative localphrasal position vector M ij to a relative localphrasal position hidden state. It is further mapped to a negative scalar P ij of local-phrase bias matrix p by a linear projection U P ∈ R d h ×1 , namely: Fig. 3 shows the process of extracting relative local-phrase bias p from the constituency tree.  4.2 Incorporating SEA into Multi-head Self-attention

Redundant Head Identification
We enhance the syntactic function of self-attention heads by dynamically identifying the redundant heads that lack the ability of capturing both shortand long-term syntactic relations to enhance these heads by incorporating SEA. We firstly apply the dependency mask Dep_mask to the attention weight matrix to obtain the corresponding syntactic attention weights which reflect short-and long-term syntactic relations. Then, we sum the syntactic attention weights for each x i among all syntax-related source tokens x j in the sequence. Finally, we calculate the average of syntactic attention weight scalar Syn attn as follows: Dep_mask (Att(Q i , K j )) (9) We propose a function gating criteria: when the average of syntactic attention weights is higher than the average of maximum attention weights, the head is regarded as important and contains syntactic functions. Different from the work in (Voita et al., 2019) which simply uses a fixed gate value to measure the importance of individual head for all training examples and epochs, our method dynamically identifies individual heads for each sentence during the training process. We compare syntactic attention weights Syn attn with dynamic and learnable syntactic gate Syn gate transformed from head confidence h conf in Eq. (3) by sigmoid activation functions, i.e., Syn gate = sigmoid(h conf ) to determine the head function. If Syn attn is lower than Syn gate , we treat the corresponding head as redundant.
h label = 1, Syn attn > Syn gate 0, other h label represents whether a head is important (h label = 1) or redundant (h label = 0). Another aspect of additional reason for comparing with the head confidence is that some

Enlivening Redundant Heads
After differing redundant heads from those important ones in the multi-head self-attention, we further enliven the redundant heads with a syntactic bias per Eq. (4) or Eq. (6) without interfering with the important head functions. (Voita et al., 2019) shows that redundant heads are mostly distributed in the lower encoder layers, meanwhile (Hao et al., 2019; shows that the bottom layer in the encoder, which directly takes word embedding as input, benefits more from modeling local relations. We evaluate the performance of applying our method on the low-and high-level encoder layers in the next section, and obtain the best performance when applying on the first encoder layer.  (Collins et al., 2005). The byte-pair encoding (BPE) toolkit 2 (Sennrich et al., 2016) is used with 32K merge operations. The 4-gram NIST BLEU score (Papineni et al., 2002) is used as the evaluation metric. We implement the proposed RHE and all the baselines on top of Transformer model (Vaswani et al., 2017) by using open-source toolkit OpenNMT (Klein et al., 2017). Please refer to the Appendix for more details of dataset and parameter setting . Table 1 shows the ablation study results of the Transformer enabled by the two proposed SEA mechanisms DEA and LPEA and the RHE approach. First, the Rows of "+DEA" and "+LPE" represent the models with all heads of the first encoder layer, including original important heads, are replaced by the syntax-enhanced attention networks DEA and LPEA respectively. Second, the RHE approach (containing the Rows of "+DEA+RHE" and "+LPEA+RHE" ) significantly lifts both DEA and LPEA mechanisms across all small and large language pairs. This tests the effectiveness of identifying and modifying redundant heads without interfering important head functions. RHE lifts the LPEA, which together i.e. LPEA+RHE substantially outperforms Transformer by +1.0 BLEU points on En→De (WMT16), +0.96 BLEU points on En→De (WMT14), and +0.81 BLEU points on En→Cs (WMT16). These results demonstrate the efficacy and applicability of both SEA and RHE designs.

RHE for NMT Results
The upper part of Table 1 shows the results of Transformer enabled by two SAN enhancement strategies: the relative position encoding method (Rel_Pos) (Shaw et al., 2018) which considers the relative position between sequence elements, and the modeling localness (Localness)  method which enhances the ability of capturing local context for self-attention with a learnable Gaussian bias. While both Rel_Pos and Localness make improvement over Transformer owing to their strategies of enhancing SAN, our DEA, DEA+RHE, LPEA and LPEA+RHEenabled Transformers substantially and consistently beat the standard Transformer and both Rel_Pos and Localness-enhanced Transformers. For example, our DEA+RHE on Transformer outperforms Rel_Pos by over 0.49 BLEU points on En→De (WMT16), 0.29 BLEU points on En→De (WMT14), and 0.36 BLUE points on En→Cs (WMT16). This is owing to the SEA and RHE design of assigning a distinct syntactic bias for each word and modeling both short-and long-term syntactic relations.

RHE Mechanism Analysis
Here, we analyze the RHE generalizability, the impact of different factors, and the visualization of multi-head attention matrices. Owing to space limitation, we only report the testing results on the En→De (WMT16) set, and explore the influence caused by syntax parsing quality and applied encoder layers in Appendix. This shows the importance of precisely identifying redundant heads, and only by then pruning redundant heads would trivially affect the learning performance as shown in (Voita et al., 2019).

Selection of Multi-head Function Gate
Two strategies can be used to select the multi-head function gate: one is a fixed gate by a constant number throughout the whole training process; the other is a dynamic gate transformed from the average of maximum attention weight c of an individual head, which provides a flexible criteria to determine the head function. Fig. 4 shows the comparison between multiple fixed gate values and the dynamic gate. We adjust the value of the fixed gate in a range (0.1, 0.5) 3 .
The results show that the dynamic gate strategy significantly outperforms all fixed gate values. The performance becomes unstable when the fixed gate value increases. Self-attention heads develop their ability to capture syntactic relations during the training epochs; accordingly, the average syntactic attention weights Syn attn increase gradually. Low fixed gate value reduces the recall of RHE because Syn attn goes high in later epochs; high fixed gate value reduces the accuracy of RHE as all important heads and redundant heads receive small Syn attn in the initial epochs. Hence, the high fixed gate might mistakenly treat a high portion of heads as redundant.

Effect of Maximum Relative Local-Phrasal Position
Compared to the dependency tree, the constituency tree characterizes the distinct relative phrasal position for each word, which enriches the syntactic relations between elements. We thus evaluate the effect of varying the clipping distance k of the maximum absolute relative local-phrasal position. The results in Table 3 show that the performance increases with the increase of k from 0 to 6, while this trend does not hold when k = 8. The average of maximum phrase layers of the training set is 11.13, which is close to the maximum absolute relative phrasal position k = 5 and k = 6 (where 2k + 1 is 11 and 13). This result indicates that the best performance appears when the relative phrasal position vector exactly covers the average of the maximum phrase layer.

Visualization of LPEA+RHE-enlivened Attention
To evaluate the effect of LPEA+RHE-enlivened redundant heads against Rel_Pos and Localness, we further visualize the attention matrices of an individual head in the first encoder layer. The source sentence is Relations between Obama and Netanyhu have been strained for years EOS . The improvement between redundant head and LPEA+RHE-enlivened head is shown in Fig. 5 (a) and (b). In Fig. 5 (a), the distribution of original redundant head attention concentrates more on the end of the sentence (16 th column) but less on the specific meaningful words. In Fig. 5 (b), SEA masks those words that do not belong to the localphrase in each row and improves the attention in local-phrase: 1) 'have been ... for years' in rows 8 and 9, which is a long-distance and discontinuous phrase; 2) SEA strengthens the attention between 'Relations' and 'Obama', 'Netanyhu' in the 1 st row, which has the nmod dependency. Fig. 5 (c) and (d) shows the results of Rel_Pos and Localness, both explicitly models the locality for self-attention networks. Both of their attention weights mainly distribute along the diagonal and some short-range elements. Rel_Pos captures the phrase 'have been' in rows 8 and 9 but ignores long-range phrase elements 'for years' since the influence of relative position representation decays as the sequence distance increases. In Fig. 5 (d), the attention weight distribution of Localness is more flexible because they assign a distinct Gaussian bias to each position, which pays more attention to the local syntactic context. It captures the phrase 'between...and' in the 6 th row. However, the attention may focus on the word itself sometimes, such as the high attention weights of 'Relations' ('R el ations' in the subword form) in the 1 st column and 'strained' ('st ra in ed' in the subword form) in the 11 th column. In contrast, LPEA+RHE enlivens the redundant head by modeling the latent syntactic localness beyond the constraints of sequence distance. Fig. 5 (e) shows the attention matrix of an important head, which focuses on neighboring words. This result is consistent with the previous findings in (Voita et al., 2019).

Conclusions
While multi-head self-attention networks show a significant potential in improving learning tasks such as NMT, an open challenging topic is to quantify the redundancy and importance of each head and further improve the weak heads. This paper makes one step forward by not only precisely analyzing and identifying redundant heads but introducing a dynamic redundant heads enlivening (RHE) mechanism to identify and enliven each redundant head toward full potential without affecting the function of other important heads as in alternatively enhancing all heads. The proposed dependency-enhanced attention and local-phraseenhanced attention effectively capture the different syntactic relations between elements. We'll work on strategies to integrate DEA and LPEA in future.  We follow the Transformer (base model) setting in (Vaswani et al., 2017) to train the models and reproduce their reported results on the En→De task. The hidden size is 512, filter size is 2,048, and the number of attention heads is 8. All models are trained on four NVIDIA TITAN Xp GPUs where each is allocated with a batch size of 4,096 tokens. We average the last 10 checkpoint models to ensure the robustness of translation performance.

A.2 Effect of Enhancing Different Layers in Encoder
The work in (Voita et al., 2019) shows that there is only one important head associated with rare words function on the first layer, while more heads are with positional and syntactic functions on higher layers. Their work indicates the necessity of lifting individual heads rather than treating them same.
In this experiment, we test this by applying the local-phrase-enhanced attention to different combinations of layers in the encoder. As shown in Table A2, enhancing the syntactic function on the first layer outperforms applying it to any other layer combinations for translation and achieves the fastest training speed due to only modifying one layer; and the performance drops with the increase of layers from bottom to top (Rows 2-5 in the table). However, enhancing the syntactic function on the higher three layers and the overall layers (Rows 6 and 1) decreases the translation performance. These results reveal that lower layers may have fewer important heads to be enhanced, while higher layers may have too many important heads, leading to harder differentiation in the enhancement. In addition, our results are consistent with the analysis in the related work  and (Hao et al., 2019), which shows that the lower encoder layers benefit more from modeling the localness and phrase structure. Accordingly, we only enhance the first layer of SAN in the following experiments.

A.3 Effect of Syntax Parsing Quality
We use an external constituency tree parser to generate the syntactic structure for the source sentence. Based on that, we can extract the local-phrase and characterize the relative local-phrasal position features to modify the self-attention network. Hence, the impact of the quality of different parsers on translation performance is necessary to be analysed.
We compare the effect of two classical constituency tree parser tools, PCFGs-based Parser (Petrov and Klein, 2007) and Neural-based Parser (Kitaev and Klein, 2018), on the performance of the LPEA+RHE mechanism. Table A3 shows the reported parsing performance (F1 score) on the Penn Treebank WSJ test set (for English) and its corresponding translation BLEU score in this work.
The results indicate that, the higher quality of parsing trees, the better performance of the syntaxenhanced NMT model across dataset sizes and lan-guages, with about 0.30 BLEU points improvement. We think that the improvement of parsing and translation is owing to that the neural-based parser leverages Transformer as encoder to represent the sentence. Although exploring the best performance of parsing tools is not the focus of this work, we believe that, with higher quality of parsing tool, our SEA mechanisms have more potential to represent the syntactic bias for self-attention network.