Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture-of-Experts (MoE) networks have been proposed as an efficient way to scale up model capacity and implement conditional computing. However, the study of MoE components mostly focused on the feedforward layer in Transformer architecture. This paper proposes the Mixture of Attention Heads (MoA), a new architecture that combines multi-head attention with the MoE mechanism. MoA includes a set of attention heads that each has its own set of parameters. Given an input, a router dynamically selects a subset of k attention heads per token. This conditional computation schema allows MoA to achieve stronger performance than the standard multi-head attention layer. Furthermore, the sparsely gated MoA can easily scale up the number of attention heads and the number of parameters while preserving computational efficiency. Despite performance improvements, MoA also automatically differentiates heads’ utilities, providing a new perspective to discuss the model’s interpretability. We conducted experiments on several important tasks, including Machine Translation and Masked Language Modeling. Experiments have shown promising results on several tasks against strong baselines that involve large and very deep models.


Introduction
In recent years, large models have become a popular trend in the research of Natural Language Processing, especially large-scale Transformer (Vaswani et al., 2017).The model's capacity has increased from millions of parameters (Devlin et al., 2019;Liu et al., 2019), to billions of parameters (Shoeybi et al., 2019;Raffel et al., 2020;Wang et al., 2022), even to trillions of parameters (Du et al., 2021;Fedus et al., 2021).How- The output is a weighted sum of the selected attention heads given the confidence calculated by the Router.
ever, these large-scale models demand substantially more computations than small-scale models.
A popular trend is to utilize conditional computation with a sparsely activated model to seek greater computational efficiency.Thus, only a part of the model's parameters is used for a specific input during the forward computation, which alleviates the computational load.
Among these attempts, the Mixture of Experts (MoE) (Jacobs et al., 1991;Jordan and Jacobs, 1994) is an essential technique.Since first applying the mixture of experts to Transformer architecture (Shazeer et al., 2018), researchers have mainly focused on combining the Feed-Forward Network layer and the Mixture of Experts.Recent works have discussed how to get a better routing strategy (Shazeer et al., 2017;Dua et al., 2021;Lewis et al., 2021;Nie et al., 2021) or how to scale up the Mixture of Experts on different nodes of GPUs (Lepikhin et al., 2021;Fedus et al., 2021).However, few attempts have explored the possibility of combining MoE with the Multi-Head Attention (MHA) mechanism.Since the MHA is another compulsory module in the Transformer architecture, combining MoE and the attention mechanism could also help achieve better performance while restraining the computational cost.
Besides, previous research has investigated the utility of different attention heads.Peng et al. (2020) found that the combination (reallocation) of a subset of attention heads helps the Translation task since they prune the useless attention heads.In the field of dependency parsing, researchers have unveiled that some attention heads in BERT-like language models (Devlin et al., 2019;Liu et al., 2019) model individual dependency types (Htut et al., 2019) and syntactic functions (Shen et al., 2022).Voita et al. (2019) claimed that the attention heads have different functions that could be categorized into three types.There is no need to pass through all multiple attention heads for an input token if we could select some relevant attention heads whose function is proper.Thus, we conceive an attention mechanism that selects different attention heads per token.
Based on the above discussion, we proposed Mixture of Attention Heads (MoA) (Section 4), an attention mechanism that selects different attention heads for different inputs.A simple illustration of this idea is shown in Figure 1.MoA includes a set of of attention heads with different parameters.Given an input, a routing network dynamically selects a subset of k attention heads for each token.The output is a weighted sum of the selected attention heads given the confidence calculated by the routing network.
We conducted experiments on two tasks: Machine Translation and Masked Language Modeling (Section 5).Experiments shown promising results against several strong baselines.In all tasks, our proposed mixture of attention heads outperforms the original Transformer architecture (Vaswani et al., 2017).Our model surpasses many large models or achieves comparable results with only a half computational cost.Our contributions can be summarized in three folds: 1) We proposed a new attention mechanism called Mixture of Attention Heads, combining the idea of Mixture of Experts with the attention mechanism.2) MoA can improve the model's performance without substantially adding parameters and computational cost.3) MoA is easy to scale up while maintaining with a restrained computation complexity, resulting in a further performance amelioration.

Related Work
Mixture of Experts The Mixture of Experts (MoE) was firstly introduced in the 1990s (Jacobs et al., 1991;Jordan and Jacobs, 1994).Shazeer et al. (2017) adopted this method into modern deep learning architectures (LSTM; Hochreiter and Schmidhuber 1997) and proved its effectiveness in Language Modeling and Machine Translation.The MoE was used to substitute the FFN layers in Transformer architecture (Vaswani et al., 2017) by the Mesh Tensorflow library (Shazeer et al., 2018).Gshard (Lepikhin et al., 2021) is a lightweight module that helps scale up multilingual neural machine translation Transformer with a Sparsely-Gated Mixture of Experts beyond 600 billion parameters.In Switch Transformer (Fedus et al., 2021), the authors scaled the MoE-integrated Transformer architecture toward trillion parameter models.GLaM (Du et al., 2021) utilized a decoder-only architecture to do language model pre-training.Rajbhandari et al. (2022) proposed a Pyramid-Residual-MoE for smaller model size and fast inference.
Various routing strategies (Shazeer et al., 2017;Dua et al., 2021;Lewis et al., 2021;Nie et al., 2021) have been investigated for stabilizing the MoE training and balancing the expert loads.Chi et al. (2022) pointed out the representation collapse issue in the sparse Mixture of Experts models and solved by a two-stage routing strategy.
Machine Translation Architectures With original Transformer architecture (Vaswani et al., 2017), Ott et al. (2018) found that training with reduced precision and large batch could improve the translation performance.Some models get better performance on translation by using larger scale of Transformer.Liu et al. (2020a) deepened the encoder and decoder of the Transformer by adequately initializing the model.DeepNet (Wang et al., 2022) scaled Transformers up to 1,000 layers by introducing a new normalization function.However, these methods require a great amount of computational cost.Some models make changes to the self-attention module.Peng et al. (2020) proposed MAE model.The reallocation of attention heads got better performance on Translation, since the model prune useless multi-head attention heads.However, their method is difficult to scale up and get further improvement of the results because it needs to use all the attention heads in the model rather than sparsely activate them.It also requires the complicated block coordinate descent training steps.Wu et al. (2019) proposed DynamicConv and LightConv by replacing self-attention mechanism with a lightweight convolution.
Specialization of Attention Heads Since the publication of Transformer architecture (Vaswani et al., 2017), many researchers have been interested in analyzing how the attention mechanism works.Voita et al. (2019) systematically analyzed the attention heads in the encoder and categorized them into three functional subsets: positional, syntactic, and rare words.When dealing with dependency parsing, researchers also observed the same phenomenon that different heads could capture different syntactic functions (Htut et al., 2019;Shen et al., 2022).

Mixture of Experts
MoE (Shazeer et al., 2017) contains a set of expert networks E 1 , E 2 , . . ., E N and a routing network G.The output of the MoE is the weighted sum of the output of each expert.The routing network calculates the probability for each expert.Formally, the output of the MoE can be written as: The routing network G is a Noisy Top-k Routing network.Before the softmax function, they add Gaussian noise to the gated logits, see Equation 3.Then, they keep only the top k values, setting the rest gate values to equal 0, see Equation 2.
(2) Vaswani et al. (2017) proposed an encoder-decoder architecture Transformer, which contains the multihead attention module.Different heads from the multi-head attention module attend to information from different representation subspaces, which learn the input from various perspectives.Performing multi-head attention with k heads, the Q, K, V are linearly projected k times with different, learned linear projections to subspaces.On each projected Q and K, the attention scores are calculated, via Equation 4. Values deriving from different heads are projected back to the model dimension size and summed up, with Equation 5. where

Mixture of Attention Heads
In this work, we propose a variant of multi-head attention for Transformer called Mixture of Attention Heads (MoA), illustrated in Figure 2. MoA consists of two major components, the routing network G and a group of N attention experts {E 1 , ..., E N }.Similar to standard multi-head self-attention, the input of MoA includes three sequences, query sequence Q, key sequence K, and value sequence V .We note q t as the query vector at time step t.For each q t , the routing network G selects a subset of k experts G(q t ) ⊆ {E i } based on q t and assign a weight w i to each selected expert.Then, these selected experts take q t , K, and V as inputs and compute an output E i (q t , K, V ).The output of the MoA is the weighted sum of the selected experts' outputs.Formally, the MoA output at time step t can be written as:

Routing Network
Similar to previous mixture-of-expert methods, the routing network assigns attention experts to input query.In order to select k experts for query q t , we compute a routing probability p i for each expert E i .The routing probability is modeled with a linear layer W g and a softmax function: Based on the routing probability p, we select the top-k attention experts among all N attention experts with the largest probabilities.Formally, the routing network is defined as: where W g ∈ R dm×N , representing the routing matrix.Then, we renormalize the routing probability of the selected experts to get normalized expert weights: where Detach(•) is a function that stops the gradient backpropagation.In other words, the denominator receives zero gradient during the training process.We empirically find that this trick helps the routing network learn better routing probability.

Attention Expert
An attention expert contains four different projection matrices, W q , W k , W v and W o .The attention calculation is similar to multi-head attention.We first compute the attention weight for keys.
where W q i ∈ R dm×d h is the query projection matrix, W k ∈ R dm×d h is the key projection matrix, d m is the hidden state size, d h is named as head dimension.We then compute the weighted sum of values: where W v ∈ R dm×d h is the value projection matrix.Finally, the attention output is obtained by projecting o i,t back to the hidden state space: where In the multi-head attention, the projection matrices W q , W k , W v , and W o are all different across attention heads.The MoA shares W k and W v across attention experts to reduce the computational complexity.Attention experts are only differentiated by W q i and W o i .Thus, the expensive matrix projection of key sequence KW k and value sequence V W v can be pre-computed and shared for all attention experts.Each expert only need to compute the vector projection of query q t W q i and output o i,t W o i .This design can significantly reduce the computational and space complexity while the number of experts is large.

Training Losses
Previous work (Shazeer et al., 2017) has observed that the routing network tends to converge to a state where it always produces large weights for the same few experts, which indicates the insufficient utility of all the experts.Following Shazeer et al. (2017) and Fedus et al. (2021), we add an auxiliary loss to balance the loads of different experts.
Given N experts and a sequence with T queries Q = {q 1 , q 2 , . . ., q T }, the auxiliary loss L a can be computed as: where f i is the number of tokens attributed to the i-th expert, where δ represents the Kronecker-Symbol.P i is the sum of router probability allocated for the i-th expert, They are then normalized with norm 1 according to the expert column.Mathematically, f i is indifferentiable while P i is.Thus, larger f i will result in a larger derivative.This penalizes the P i making larger P i smaller.What's more, P i is calculated by softmax.Thus smaller P i will become bigger.Zoph et al. (2022) introduced a router z-loss (Equation 16) to penalize large logits into the gating network, which could stabilize the training and improve the performance.
where x i,t is the pre-softmax logit computed by router for i-th expert and input query q t .Each mixture of attention heads module has an auxiliary loss and a router z-loss.We sum them up together and added with a multiplicative coefficient α and β respectively to the total model loss during training.Throughout this work, we use α = 0.01 and β = 0.001 to ensure the efficiency of the two added losses and not to disturb the primary cross-entropy model loss.
To validate the utility of these auxiliary losses, we conducted ablation tests and the results are shown in Appendix D.

Computational Complexity and Number of Parameters
On the one hand, given a sequence with T tokens, the amount of computation required by an MoA layer that selects top-k experts is where kd h is the sum of head dimension of selected experts.It represents the maximum amount of information that can be collected by an MoA layer for a token.On the other hand, the amount of computation required by a standard Multi-Head Attention (MHA) is where d m is the sum of head dimension.If kd h ≃ d m , the computational complexity of MoA is smaller than that of MHA.In other words, the MoA could collect more information for each token while maintaining a similar level of computational complexity as the MHA.
As for the number of parameters, given a Mixture of Attention Heads with E attention experts, the number of parameters in MoA and MHA are: the number of parameters in MoA is smaller than MHA.In other words, MoA could collect more information for each token while maintaining a similar number of parameters as MHA.More details of the calculation are in Appendix B.
The above discussion suggests that, from an information collection point of view, the MoA is more computational and parameter efficient than the standard MHA.Our experimental results in Section 5 also empirically support the hypothesis.Additionally, the time complexity of MoA is decided by the number of attention heads k and the attention head dimension d h , not the model's total parameters.One could arbitrarily increase the amount of parameters in MoA, without increasing its computational complexity.

Machine Translation
Dataset We train our Mixture of Attention model on WMT 2014 English-German and English-French datasets (Bojar et al., 2014).Following the experimental settings used in Liu et al. (2020b), all sentences were encoded using byte-pair encoding (Sennrich et al., 2016).For both tasks, we use a joined dictionary and share all word embeddings of the encoder and the decoder.For English-German, their shared vocabulary size is set to be 32k.For English-French, their shared vocabulary size is set to be 40k.For the evaluation, we average the last 10 epochs' checkpoints.We list BLEU score (Papineni et al., 2002) computed with MULTI-BLEU.PERL, and apply the compound split post-processing 2 introduced in Vaswani et al. (2017).We use MACs (Multiply-Accumulate Operations) 3 to evaluate the computational complexity of different models on a fixed input.Details of the MACs calculation are in Appendix E.
Baselines We compare with several strong baselines: Transformer base and big (Vaswani et al., 2017), Transformer big (Ott et al., 2018) with reduced precision and large batch training, Dynamic-Conv (Wu et al., 2019) by replacing self-attention mechanism with a lightweight convolution, MAE-7 with reallocation of attention heads proposed by Peng et al. (2020), Admin (Liu et al., 2020a) which deepens the Transformer architecture.
For our model, three parameters are used to differentiate its variants, one is number of the activated attention heads (K) per token, one is total 2 https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh 3 We adopt the open-source tool PTFLOPS (https:// github.com/sovrasov/flops-counter.pytorch) to calculate the MACs. 4 These MACs values are underestimated.Because the PTFLOPS does not support the customized convolution layers in DynamicConv and LightConv.number of the experts (E), another is the attention expert dimension (D).For example, our MoA base model is noted as 8K8E128D, because it has 8 attention experts, 128 dimension per expert, and all 8 experts are activated for each token.Our MoA big model is 16K32E256D as it has 32 attention experts and sparsely activates the top 16 experts for each token.

Results
The results on the test set of WMT14 EnDe and WMT14 EnFr datasets are shown in Table 1.The table is split into 2 parts, the upper part is for base models and the lower part is for large models.On all datasets, MoA base outperforms Transformer base and Admin 6L-6L by at least 0.6 BLEU.On the WMT14 EnFr dataset, MoA base also outperforms Transformer big.On the WMT14 EnDe dataset, MoA base reaches comparable results with the Mixture of Attention Experts model (MAE-7), which is the state-of-the-art performance for base-level models.MACs of MAE-7 and our model are comparable in the setting of 8 attention heads.While both models leverage the idea of weighting the attention heads, MoA is easier to implement and does not require the complicated block coordinate descent training steps.Compared to standard multi-head self-attention, the routing mechanism pays more attention to the more informative attention heads for each token, thus enabling the MoA base model to achieve better computation and parameter efficiency.
In the big-scale setting, MoA big consistently outperforms standard transformer big models, despite requiring significantly less computation.Compared to the models with more parameters,  MoA is still very competitive.Only Admin 60L-12L outperforms MoA big on both datasets.However, the model has more parameters and requires about two times of MACs.The MACs of MoA big is 1220M, which is the lowest amount among bigscale models.This result shows that our proposed method could easily scale up to a large amount of parameters and achieve good results without substantially burdening the computation system.

Masked Language Modeling
Masked Language Modeling is the standard training objective for many Pretrained Language Models (PLMs), including BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).The task replaces a random sample of tokens in the input sequence with a special token [MASK].The training objective is a cross-entropy loss on predicting the masked tokens.To better mimic the procedure of training PLMs, we adopt the setting introduced in RoBERTa (Liu et al., 2019) to conduct the masked language modeling experiment.
Dataset We conducted the masked language modeling on the wikitext-103 dataset (Merity et al., 2016).The corpus includes over 100 million tokens collected from verified Good and Featured articles on English Wikipedia.Following the settings in Merity et al. (2016), the training/validation/test set has 103M/218K/246K words.The corpus is tokenized with the 50K subword vocabulary used in RoBERTa and initially introduced in GPT (Radford et al., 2019).
Settings Then we train the model with the dynamic masking strategy and full-sentences input format.To avoid overfitting on the training corpus, we adopt a medium-size RoBERTa model as the base model, with 512-dim word embedding, 2048dim feed-forward network, 8 heads, and 8 layers.Training details can be found in Appendix C. The perplexity is used as the evaluation metric.

Results
Table 2 shows the perplexity on WikiText-103 test data.While using a similar amount of parameters, MoA outperforms the standard transformer model by 0.13 perplexity.Furthermore, the performance simultaneously improves with the increase of number of experts E and head size D, while the number of selected heads K and the computational complexity remains the same.The observation shows that our model has the ability of increasing the model's performance while maintaining the computation complexity.

Model Analysis
MoA parameter influence We study the influence of three parameters, K, E, and D, on the WMT14 En-De Dataset.The results are shown in Table 3.For the expert dimension D, we control K = 8 and E = 32, vary the expert dimension D with 64, 128, and 256.As the expert dimension size D increases (rows C, D, E in Table 3), the PPL on the validation set and the BLEU score on the test set both improve.This amelioration is due to the increase in parameters.With a larger expert dimension size, each expert has more parameters, and the computational costs increase.We believe the increase in computational cost is acceptable.As in Table 1, Transformer big model has a MACs of 2090M, reaching BLEU of 28.4.However, by enlarging hidden size of expert, we could get BLEU of 28.8 while MACs at 841M («2090M).
For the number of attention experts E, we control K = 8 and D = 256, select three different values of E, 8, 16, and 32.When adding the number of experts, the PPL on the valid set goes down, indicating our model's continuous scaling up ability.The BLEU score on the test set does not change with that of PPL, and this may be because the training objective is not directly linked to the BLEU score calculation.However, we still observe that 32 experts can achieve better BLEU than the other two settings.As the number of selected attention heads K remains unchanged, the MACs for these three settings are the same.Thus, MoA allows us to improve the model ability by adding more parameters without changing the computational complexity.
For the number of selected attention heads K, we test three numbers of selected attention heads K, 4, 8, and 16, freezing E = 32 and D = 256.With the increase in the number of selected attention heads, we observe that the PPL on the valid set de-   creases and the BLEU score on the test set goes up.
Since the number of attention experts remains the same, the model's total parameters stay at 200M.This result shows the trade-off between computation efficiency and performance.The model needs more computations for better performance as the MACs vary from 654M to 1220M.
MoA Expert loads Load balancing is a longstanding problem of MoE models (Fedus et al., 2021).Figure 3   PMI(token i , expert j ) = p(token i , expert j ) p(token i ) • p(expert j ) .
For each expert, the bigger the PMI, the more relevant the token with this expert.

Limitations
In this work, we scale up MoA to at most 64 experts.However, regarding the works combining mixture of experts with FFN layer, they could expand the expert number to thousands.In the future, we will explore the limit of the scale up ability of MoA.
Our implementation of MoA is not optimistic.Our code could not fully explore the parallel computing capability of GPUs.Our current implementation spends some extra time on memory copy operations.Although the computational complexity (MACs) of MoA is relatively low compared to other baselines, the running time of our implementation is not optimal.In the future, if we could optimize the implementation at the cuda kernel level to remove the memory copy ops, we expect at least half the wall-clock time.This will make an MoA block as fast as a standard attention block.
Similar to Transformer architecture, MoA needs a careful hyperparameter search to reach satisfying results.
A Experts' load percentages   We note q = Ed head d model , then we get

B Computational Complexity Proof
When Ed head ≃ d model , we have q ≃ 1, and which is a hyperbolic-like curve, with value equals to 1 when E = 1.Therefore, if E > 1, we have the proportion between the parameters of MoA and that of multi-head attention inferior to 1. Thus, the MoA layer contains fewer parameters than multi-head attention layer.

C Traning Details
All of our models are trained on 32 V100 GPUs.We use the Adam Optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and ϵ = 1e − 9. We use a inverse square root learning rate scheduler for the translation tasks and a linear scheduler for the masked language model task.During training, we employed label smoothing (Szegedy et al., 2016) of value 0.1.More training hyperparameters can be found in Table 6.

Figure 1 :
Figure 1: Simple illustration of MoA.MoA consists of a set of attention heads named attention experts.For each token in the input, a Router selects k attention heads among all attention experts with different confidences.The output is a weighted sum of the selected attention heads given the confidence calculated by the Router.

Figure 3 :
Figure 3: Experts' load percentages for encoder layer 4. Experts are indexed by their order of percentages.

Figure 4 :
Figure 4: Experts' load percentages for different encoder layers We compare the attribute distribution of tokens on different experts for different encoder layers of 16K32E512D.The results are shown in Figure 4.The load percentages for each expert in different layers are relatively balanced.
Given a Mixture of Attention Heads with E attention experts, a MoA layer has (2E + 2)d head d model parameters.A multi-head attention layer has 4d 2 model parameters.To compare these two complexities, we conduct the fraction of them.

Table 1 :
BLEU score on WMT14 translation datasets.MACs (Multiply-Accumulate Operations) 3 measures the computational complexity of each model.For different models, their MACs are computed on a source sentence of length T src = 10 and a target sentence of length T tgt = 10.

Table 2 :
Perplexity on wikitext-103 corpus test data for masked language modeling.MACs are computed on a input sequence of length T = 128.

Table 3 :
BLEU score of different MoA models on the WMT14 EnDe Dataset.

Table 4 :
Indicative tokens of each expert for the first encoder layer of MoA