Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention

Scaling pre-trained language models has resulted in large performance gains in various natural language processing tasks but comes with a large cost in memory requirements. Inspired by the position embeddings in transformers, we aim to simplify and reduce the memory footprint of the multi-head attention (MHA) mechanism. We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE), i.e. one per head. We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms while achieving high predictive performance retention ratio to vanilla MHA on several downstream tasks. MHE attention only requires a negligible fraction of additional parameters ($3nd$, where $n$ is the number of attention heads and $d$ the size of the head embeddings) compared to a single-head attention, while MHA requires $(3n^2-3n)d^2-3nd$ additional parameters.


Introduction
Scaling pre-trained language models (PLMs) aims to enhance performance by increasing their size and capacity, leading to models with an unprecedented number of parameters (Kaplan et al., 2020;Chowdhery et al., 2022;Hoffmann et al., 2022).Just by increasing the size of PLMs and the pre-training data has yielded state-of-the-art performance on various natural language processing (NLP) tasks (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020;Raffel et al., 2020;Brown et al., 2020;Clark et al., 2022a;Ouyang et al., 2022;Touvron et al., 2023).
However, the pursuit of developing larger PLMs comes with large computational requirements.This has direct environmental implications such as large carbon emissions (Lacoste et al., 2019;Strubell et al., 2019;Weidinger et al., 2022), conflicting with the principles of Green artificial intelligence development (Schwartz et al., 2020).Moreover, scaling can hinder researchers with limited access to computing resources to participate in advancing the field (Schwartz et al., 2020).This results in inequalities, where only a privileged few can actively contribute, potentially impeding diversity and inclusivity (Weidinger et al., 2022).
The backbone of transformers (Vaswani et al., 2017) is the multi-head attention (MHA) module that extends the standard single-head attention (SHA) proposed by Cho et al. (2014).MHA applies an attention mechanism (i.e.head) multiple times for the same set of queries, keys and values by using a different set of parameters (i.e.projection matrices) for each of them.This results in MHA modules with a large memory footprint that increases with the number of layers and attention heads per layer in PLMs (Devlin et al., 2019;Brown et al., 2020;Ouyang et al., 2022;Touvron et al., 2023).Figure 1 shows how the number of parameters of a single attention sublayer increases with its number of attention heads.
Previous work has attempted to address this issue by proposing to share projection matrices or eliminating them entirely to improve the parameter efficiency of MHA.Lan et al. (2020a) proposed sharing projection parameters for keys, queries and values across layers, while Kitaev et al. (2020) introduced a method for sharing the projection matrix between keys and values within each transformer layer.Additionally, similar approaches use a multi-query attention approach that uses a pair of global projection matrices for keys and values in each layer (Shazeer, 2019;Chowdhery et al., 2022;Ainslie et al., 2023).Furthermore, Yan et al. (2021) eliminate the projection matrices entirely and directly treat the input hidden states as both keys and values.In a different direction, Lee-Thorp et al. (2022) propose models that replace the attention blocks with token-mixture blocks (i.e. using linear or Fourier transformations) that contain fewer or no parameters compared to MHA.
Inspired by the position embeddings in transformers (Vaswani et al., 2017;Devlin et al., 2019), we aim to simplify and reduce the memory footprint of the MHA mechanism.We achieve this using only a single projection matrix for each of the keys, queries and values respectively shared across all attention heads, and one embedding per head (MHE).
Our contributions are as follows: • We propose MHE, a novel attention module that uses shared projection matrices across heads that are modified by corresponding embedding heads.Our method generates multiple attention heads requiring only a small fraction of additional parameters compared to single-head attention.
• We empirically demonstrate that our MHE attention is substantially more parameter efficient compared to alternative attention mechanisms while achieving high predictive performance retention ratio (i.e.92.9~98.7%) to MHA on several downstream tasks.MHE is (3n 2 − 3n)d 2 − 3nd smaller than MHA for a single attention sublayer with n attention heads and a hidden dimension of d per head.
2 Related Work

Model Compression
To make PLMs memory efficient, previous work has focused on the following post-hoc model compression approaches (Ganesh et al., 2021;Tay et al., 2022).
Quantization Hubara et al. (2017) proposed representing weights using fewer bits to reduce memory requirements.Zadeh et al. ( 2020) introduced a method for identifying the outliers in weights and excluded them during quantization.Another direction involves additional training steps to adjust the quantized weights, i.e. quantization-aware training (Zafrir et al., 2019;Boo and Sung, 2020;Stock et al., 2020;Shen et al., 2020;Tambe et al., 2021;Tao et al., 2022).Bai et al. (2022) developed a more efficient post-training quantization approach that minimizes the reconstruction error incurred by quantization.
Pruning These compression approaches remove entirely parts of the network such as weights close to zero (Gordon et al., 2020;Mao et al., 2020;Chen et al., 2020) and weights that move towards zero during fine-tuning (Sanh et al., 2020;Tambe et al., 2021).Different to operating on individual weights, previous work attempted to remove structured blocks of weights or even architectural components such as attention heads and encoder layers (Fan et al., 2019;Prasanna et al., 2020;Khetan and Karnin, 2020;Li et al., 2020a;Lin et al., 2020;Tay et al., 2021).
Knowledge Distillation This set of techniques typically train a light-weight student model to mimic the outputs of a larger teacher PLM (Sun et al., 2019;Li et al., 2020b;Jiao et al., 2020;Sun et al., 2020;Li et al., 2021;Tahaei et al., 2022).In a similar direction, smaller PLMs have been recently fine-tuned on text generated by larger PLMs (Chiang et al., 2023;Taori et al., 2023).
Weight Matrix Decomposition Previous work also proposed replacing large weight matrices by the product of two smaller ones for reducing model size and runtime memory.Weight matrix decomposition has been applied to linear layers (Mao et al., 2020;Ben Noach and Goldberg, 2020), the embedding matrix (Lan et al., 2020b;Tambe et al., 2021;Wang et al., 2022), and attention blocks (Hu et al., 2022;Wang et al., 2022).
Embedding Matrix Compression Finally, various attempts have been introduced for compressing the embedding matrix during pre-training and finetuning (Xue et al., 2022;Clark et al., 2022b;Xue and Aletras, 2022).

Improving Attention Efficiency
Previous work on making attention more efficient includes efforts towards (1) speeding-up pairwise computations between token representations; and (2) parameter efficiency.
Computational Efficiency While improving computational efficiency of attention is out of the scope of our paper, we provide a brief overview of previous work since it is complementary to parameter efficiency.One approach to speed up attention computation is by reducing the number of similarity computations between representations in different positions using predefined local windows, fixed or dynamic strides (Child et al., 2019;Zaheer et al., 2020;Beltagy et al., 2020;Kitaev et al., 2020).Other methods leverage the approximation of SoftMax to change the order of matrix multiplications, resulting in lower computational complexity (Katharopoulos et al., 2020;Choromanski et al., 2021;Schlag et al., 2021;Qin et al., 2022).Related approaches along this direction proposed kernel functions that require additional parameters (Choromanski et al., 2021;Wang et al., 2020).Finally, Dao et al. (2022) proposed improvements in GPU memory access to optimize and accelerate the MHA computation.
Memory Efficiency Lan et al. (2020a) introduced a method for sharing the projection parameters for queries, keys and values across transformer layers.Furthermore, Kitaev et al. (2020) proposed sharing the projection matrix between keys and values within each layer.Additionally, other methods use a multi-query attention approach that shares projection weights for keys and values across different heads (Shazeer, 2019;Chowdhery et al., 2022;Ainslie et al., 2023), while Yan et al. (2021) directly treat the input hidden states as both keys and values.In a different direction, Lee-Thorp et al. (2022) proposed replacing the attention blocks with faster token-mixture blocks consisting of a few parameters or no parameters at all.This includes methods such as linear or Fourier transformations in the token-mixture block.However, these approaches tend to yield lower predictive performance compared to MHA.

Multiple Head Embeddings Attention
Inspired by the absolute position embeddings (Vaswani et al., 2017;Devlin et al., 2019) for distinguishing the representation of the same token in different contexts, we propose Multiple Head Embeddings (MHE) attention.MHE uses a shared 'seed' projection matrix that is subsequently combined with distinct head embeddings to generate multiple attention heads.

Multi-head Attention (MHA)
We first begin by formally defining MHA.MHA consists of different projection matrices , where d m is the dimension of the input representation and d h is the dimension of n attention heads) for queries (Q), keys (K) and values (V ) per head, 3 × n in total.It is computed as follows: Note that we use scale-dot attention, but our method can be used with any other attention mechanism.

Seed Projection Matrix
Unlike MHA that uses different projection matrices per head, MHE attention employs only a single projection matrix for each of the queries, keys and values, W Q , W K , W V ∈ R dm×d h .These matrices are shared across all attention heads.We obtain query, key and values projections of the input sequence X as follows:

Attention Head Embeddings
Using a seed projection matrix for Q, K, V is equivalent to a single-head attention (SHA) module.Therefore, we need a mechanism to transform the seed projection matrices to obtain different attention head.For this purpose, we represent each attention head i by specific head embeddings .., n for queries, key and values.These embeddings have a substantially smaller memory footprint compared to using different projection matrices per head.The contextualized representation H i of the entire input sequence X for head i is computed as follows: where ψ(•) is a function that modifies the query, key and value matrices with a corresponding head embedding e i .
where n is the number of attention heads.Multi-head embedding attention (right) uses only three projection matrices and 3 × n head embeddings.

Modifying Queries, Keys and Values with Head Embeddings
We propose two MHE variants, one adds and the other multiplies the head embeddings with the seed projection matrices.

Attention Mechanisms
We compare our MHE attention with the following attention mechanisms:3 • Multi-head Attention (MHA): This is the original multi-head attention mechanism (Vaswani et al., 2017;Devlin et al., 2019).
• Single-head Attention (SHA): Similar to MHA but using only one attention head.
• EL-ATT: Introduced by Yan et al. (2021), this attention variant completely eliminates the projection matrices for all keys and values.
• MQA: Introduced by Shazeer (2019), this approach uses shared projection matrices for keys and values across all attention heads.Note that different projection matrices are used for queries across heads.
• SKV: Introduced by Kitaev et al. (2020), this attention variant enforces keys and values to share the same projection matrix within each attention module.
Encoder-only For GLUE, SUPERGLUE, SQUAD V1.1 and SQUAD V2.0, we use a BERT-base architecture.This consists of 12 transformer layers, embedding size of 768, hidden states dimension of 768, 12 attention heads and a maximum sequence length of 512.

Decoder-only
We also test a decoder-only model using the GPT2-base architecture on WIKITEXT-103, PENN TREEBANK and GLUE.GPT2-base consists of 12 transformer layers, embedding size of 768, hidden states dimension of 768, 12 attention heads and a maximum sequence length of 512.
Encoder-decoder For WMT-14, we train an encoder-decoder transformer from scratch.It consists of 12 layers (6 for the encoder and decoder respectively), an embedding size of 512, hidden states dimension of 512 and 8 attention-heads and a maximum sequence length of 100.
We set the number of attention heads to 1 for all SHA models.Experimenting with larger models and different number of attention heads is out of the scope of our paper and left for future work due to limited access to computing resources.

Implementation Details
Pre-training We pre-train all models on the English Wikipedia and BookCorpus (Zhu et al., 2015) from HuggingFace (Lhoest et al., 2021) for up to 1M steps with a batch size of 128.We choose masked language modelling as the pre-training objective.For all models, we use a 30K WordPiece vocabulary (Devlin et al., 2019).

Fine-tuning and Training
For GLUE, SUPER-GLUE, SQUAD V1.1 and SQUAD V2.0, we finetune all pre-trained models up to 20 epochs with early stopping fixing the batch size to 32.For each task, we use five different seeds and report the average.
We train the encoder-decoder model from scratch on the training set of WMT-14 Englishto-German machine translation dataset up to 100K steps with a batch size of 256.WMT-14 contains 4.5M sentence pairs and evaluate on its test set.We train the tokenizer using byte-pair-encoding (Sennrich et al., 2016) with 37K merging steps on the training set.We enable both source language and target language to share the vocabulary.We use one random seed and report the average on the last five epochs.We optimize all models using AdamW (Loshchilov and Hutter, 2019).
Hyperparameters Hyperparameter selection details are in Appendix B.
Hardware For pre-training, we use four NVIDIA Tesla A100 GPUs and one for fine-tuning on downstream tasks.

Predictive Performance Evaluation
For GLUE, SUPERGLUE, SQUAD V1.1 and SQUAD V2.0, we use the official metric of each task (see Appendix A for details on metrics for each task).We report F1 score for SQUAD V1.1 and SQUAD V2.0.We use BLEU to report performance in WMT-14 English-to-German machine translation task.We use perplexity (PPL) to report generative performance on WIKITEXT-103 and PENN TREEBANK by fixing the stride length to 256.

Memory Efficiency Evaluation
Furthermore, we use the following metrics to measure and compare the memory efficiency of MHE and the baselines.
• Performance Retention Ratio: We compute the ratio between the predictive performance of each attention mechanism compared to MHA upper-bound baseline performance (the higher the better).

Predictive Performance Comparison
Table 1 presents results on GLUE, SUPERGLUE, SQUAD V1.1 and SQUAD V2.0 for our MHE variants and all baselines.We first observe that both the performance of our MHE-ADD and MHE-MUL are comparable to the vanilla MHA on two text classification benchmarks (80.4,80.6 vs. 81.9 on average GLUE and 69.1, 69.6 vs. 70.5 on average SUPERGLUE) with high performance retention ratios (PRR) between 97.9% and 98.7%.On question answering tasks SQUAD V1.1 and SQUAD V2.0, both MHE variants are also competitive, with PRRs higher than 93%.Similar results are observed on the WMT-14 English-to-German machine translation task for the encoder-decoder transformer.According to Table 3, MHE-ADD and MHE-MUL achieve BLEU scores of 23.0 and 23.6, respectively.The performance of MHE-MUL is negligibly lower than that of MHA (24.8) while being substantially smaller.
Consistent results for the decoder-only transformer are shown in Table 2.The PRRs for MHE-ADD and MHE-MUL on GLUE are still high (i.e.97.8% and 99.0%).While using the intrinsic met-rics for evaluation, MHE-MUL leads to the perplexities of 53.8 and 50.7 compared to 43.0 and 44.3 for MHA on WIKITEXT-103 and PENN TREEBANK respectively, indicating a stable PRR higher than 74.9%.
In all tasks, MHE consistently outperforms SHA by a large margin with only 0.03M extra parameters, i.e. 0.6~17.4.For example, 69.6 vs. 67.1 in SUPERGLUE, 72.3 vs. 67.6 in SQUAD V2.0, 23.6 vs. 22.5 in WMT-14 and 62.0 vs. 53.8 in WIKITEXT-103 for the MHE-MUL variant.We also note that MQA and SKV attention mechanisms generally perform better than MHE, however they are 1.7 and 2.4 times larger than MHE, i.e. 15.34M and 21.23M vs. 8.88M parameters.It is worth noting that MHE-MUL outperforms EL-ATT on three out of five benchmarks, despite having nearly half the parameters in the attention module.

Memory Efficiency Comparison
Our results so far indicate that performance increases with the number of attention mechanism parameters, which is expected.Next, we inspect how efficiently different attention mechanisms uti-lize their parameters 5 .Tables 1 and 3 show how parameter efficient our two MHE attention variants and all baselines are, measured in PEoP.Note that PEoP scores for SHA cannot be computed as it is used as the point for reference model.We also report PRR using MHA as a baseline for completeness, however this metric does not take the model size into account.
We first observe in Table 1 that both our MHE-ADD and MHE-MUL achieve the highest PEoP scores on the two natural language understanding benchmarks (4.92, 5.53 on GLUE, and 9.44, 12.07 on SUPERGLUE) and two question answering tasks (4.65, 13.19on SQUAD V1.1, and 19.88, 22.25 on SQUAD V2.0).In contrast, vanilla MHA results in the lowest PEoP score among all models as expected, ranging from 0.02 to 0.06.It indicates the memory inefficiency of MHA.
The PEoPs of more light-weight EL-ATT and SKV are similar to that of MHA (0.02) on average GLUE, barely 4 ‰of that of MHE, indicating they are far more memory-inefficient compared to MHE.
Similar findings are observed in WMT-14 for the encoder-decoder models depicted in Table 3. MHE-ADD and MHE-MUL achieve PEoP scores of 20.0 and 27.9, respectively.In contrast, the PEoP scores of MHA, EL-ATT MQA and SKV are close to zero (barely 0.1).This means that investing more parameters into their attention modules would not bring proportional benefits in predictive performance.Even for the SKV which is half the size of MHA and achieves high PRR, when the number of parameters increase by 1%, the BLEU score increases a negligible 0.1%, while evolving from SHA.However, with the same number of parameters, our most memory-inefficient MHE-MUL is able to improve the BLEU score by 11.0%.Such rate of return is 110 times larger than that of SKV.Leveraging the head embeddings by adding only a negligible number of parameters efficiently improves the predictive performance.
We further observe that MHE-ADD and MHE-MUL are architecture-agnostic, obtaining similar memory efficiency for the decoder-only model in In all tasks, MHE consistently outperforms MHA by orders of magnitude in parameter efficiency.We also note that EL-ATT, MQA and SKV only lead to PEoP scores with the same magnitude as MHA.This highlights the more superior parameter utilization of MHE attention variants, achieving state-of-the-art memory-efficiency.

Theoretical Memory Complexity
Table 4 presents the theoretical memory complexity and the total number of parameters of our two MHE and baseline attention mechanisms in a single transformer sublayer.First, we see that the theoretical memory complexity of MHA and other efficient parameters (EL-ATT, MQA and SKV) are quadratic with the number of attention heads, while our MHE are the only two variants having the complexity linear with the attention heads similar to SHA.
Taking a closer look at the rightmost column in Table 4, we observe that the number of extra parameters of all attention variants compared to SHA have a quadratic relationship to both the number n and the dimension of attention heads d, except our two MHE variants.MHE only requires a relatively small fraction of additional parameters compared to SHA.
Table 4: Memory complexity regarding the number of parameters in each attention sublayer, while fixing the dimension of attention heads to d. n denotes the number of attention heads.To simplify, the dimension of hidden states d m is set to nd.The last projection for pooling attention heads is excluded.

Scaling the Number of Attention Parameters
Delving deeper to the effect of scaling to memory footprint, we show in Figure 3 the total number of parameters needed for a single attention module (e.g. in an encoder layer).We fix the dimension of attention heads to 64 commonly used by BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT-2 (Radford et al., 2019), BART (Lewis et al., 2020) and T5 (Raffel et al., 2020).In general, we note that the number of parameters in MHA could reach more than 200M if employing 128 attention heads.At the same time, SKV, MQA and EL-ATT would require 2/3, 1/3 and 1/3 of that number respectively.In contrast, MHE only accounts for 1% of the MHA parameters.Moreover, we also present in Figure 4 the total number of parameters required across attention variants when stacking 12, 24 and 48 layers along with 32 and 64 attention heads respectively.We also fix the dimension of attention heads to 64.We can observe, when the number of attention head reaches 64, MHA with 24 layers already occupies more than 1B parameters, while EL-ATT and MQA reach 0.8B parameters with 48 layers.SKV takes 24 layers to reach 0.8B parameters.However, the total number of parameters in MHE attention does not exceed 0.1B even when scaling to 48 layers with 64 attention heads.It is also clear that scaling the attention module to 48 layers, 32 attention heads and 12 layers needs a comparable number of parameters for MHA, EL-ATT, MQA or SKV.This indicates, that LLM developers have to make a choice whether doubling the number of attention heads or cutting down the number of layers to a quarter when working under a tight memory budget.However, MHE does not suffer by such issues.
Further, we project these estimates to the popular GPT-3 model (Brown et al., 2020).It is a decoder-only model with 96 decoder layers, 96 attention heads per layer, and a head dimension of 128.The vanilla multi-head attention module requires a massive 43.48B parameters.However, using MHE attention, this number can be significantly reduced to 0.46B parameters, i.e. approximately a reduction by 98.9%. 6Comparing this to other parameter-efficient attention variants such as EL-ATT (14.50B parameters), MQA attention (14.80B parameters), and SKV attention (28.99B parameters), it becomes evident that our MHE offers better memory efficiency.This makes it a compelling alternative for memory-constrained scenarios.See Appendix D for a detailed study on the robustness of MHE to model size changes (i.e.scaling).

Discussion
MHA enables the model to attend to information from different representation subspaces at different positions (Vaswani et al., 2017).It uses distinct projection matrices for each attention head and integrates the information from these different representation subspaces.However, Vaswani et al. (2017) did not explore different methods for performing space transformations per head.
Previous work has pointed out that overparameterized models might have a low intrinsic dimension.Therefore, transforming the projection matrices to smaller low-rank ones usually does not severely harm model predictive performance (Li et al., 2018;Aghajanyan et al., 2020).Meanwhile, the classic MHA approach also does not impose any constraints on the orthogonality of these subspaces during pre-training and fine-tuning.The column vectors in those projection matrices could be highly collinear, i.e. the projection matrices could be rank-deficient.As a result, its inner-working mechanism could be simply understood as introducing levels of variation to the encoded representation of the same token at the same position across different heads.
Our MHE approach is possible to achieve memory efficiency (similar to SHA) together with high PRR compared to MHA by mimicking the position embeddings for representing different attention heads.
On one hand, the addition operation in MHE-ADD is used for transforming the keys, queries and values.This can be seen as a small distortion of the subspace obtained through projection, followed by rotation.For an input representation, the difference between the projected and injected (i.e. through head embedding addition) queries, keys and values is a constant vector across any pair of heads.On the other hand, the MHE-MUL approach employs a multiplication operation, which more aggressively distorts and reshapes the keys, queries and values subspaces.Head embeddings in MHE-MUL play a role as the scaling factors, respectively stretching each dimension of the input representation.Thus, the difference between the keys, queries, and values generated by different heads for the same input representation, is a vector parallel to the projected input.This vector is dependent on the specific input, unlike the constant vector in MHE-ADD.
Interestingly, our experimental results consistently show that the multiplication operation outperforms addition in the majority of benchmarks.This corroborates findings of a previous empirical study by Su et al. (2021) that compared rotary position embeddings (somehow analogous to MHE-MUL) with absolute position embeddings (analogous to MHE-ADD).

Conclusions
We have proposed MHE attention that employs a single shared projection matrix along with multiple head embeddings, to simplify and reduce the memory footprint of the MHA.Our experi- mental results have demonstrated that MHE attention exhibits superior memory efficiency compared to other memory-efficient attention variants, while achieving high predictive performance ratio to MHA on various downstream tasks.Compared to a single-head attention, MHA requires (3n 2 − 3n)d 2 parameters for n attention heads and head dimensionality d, while MHE barely requires a negligible 3nd.For future research, we plan to investigate scaling up MHE models and explore its linguistic capabilities (Vulić et al., 2020;Koto et al., 2021).

Limitations
We experiment only using 'base' size models without experimenting with larger architectures, due to limited access to computational resources.Similarly, we did not experiment with decoder only architectures (Brown et al., 2020) which we leave for future work.We have not combined our MHE method with computationally efficient attention methods with linear complexity, such as Linformer (Wang et al., 2020).We expect that it would speed up computation of MHE, but it is out of the scope of our paper.

A Reported Metrics for Each Task
We evaluate all models on GLUE (Wang et al., 2018), SUPERGLUE (Wang et al., 2019), SQUAD V1.1 (Rajpurkar et al., 2016) and SQUAD V2.0 (Rajpurkar et al., 2018).We report matched accuracy for MNLI, Matthews correlation for CoLA, Spearman correlation for STS, F1 score for QQP, CB, MultiRC and SQUAD and accuracy for all other tasks.Table 5 and Table 6 present results on GLUE and SUPERGLUE respectively for our MHE-FORMERS models and all baselines with the encoder-only architecture.Table 7 and 8 present results of the scores and performance elasticity of parameters (PEoP) across all models over each task in GLUE and SUPERGLUE.Table 10 presents results on GLUE for our MHE-FORMERS models and all baselines with the decoder-only architecture.Table 10 presents results of the scores and performance elasticity of parameters (PEoP) across all models over each task in GLUE.

B Hyperparameters
The hyperparameters used in pre-training are listed in Table 11.The hyperparameters used in finetuning are listed in Table 12.

C Memory Usage
To further illustrate the memory-efficiency of our MHE models compared to the baselines, we take the BERT-base architecture (12 attention heads, each with a dimension of 64) as an example, and measure the memory usage per attention block as in Section 2.1.1 from Smith et al. (2022) and report the memory usage saving ratio (%) during the attention calculation in Table 13: The calculation is based on inputs with batch size of 32, hidden dimension of 768, sequence length of 512 and fp16 mixture precision training using the following formula: • Memory(weights)=#params*(2+4) bytes; • Memory(gradients)=#params*(2+4) bytes; • Memory(Adam states)=#params*(4+4) bytes; • Memory(activations)= batch-size*sequencelength*hidden-dimension*2 bytes.
From Table 13, we observe the memory usage saving ratio of our proposed MHE is 2.75 times better than SKV, 1.50 times better than MQA and 1.37 times better than EL-ATT, which indicates a SotA memory saving capabilities compared to all other parameter-efficient attention variants.

D Robustness to Scaling
We also conduct experiments to observe the effectiveness and the robustness of our best MHE-MUL while scaling the model size.
Table 14 presents average accuracy on two text classification benchmarks (GLUE and SUPER-GLUE), perplexities on two language modelling benchmarks (WIKITEXT-103 and PENN TREE-BANK) with their corresponding performance retention ratio (PRR) for MHA and MHE-MUL in both encoder-only and decoder-only architecture across different model sizes. 7For the encoder-only models, we observe that the PRR of MHE-MUL remains stable on GLUE (from 98.4% to 98.7%) and SUPERGLUE (from 98.7% to 96.2%) while scaling the number of parameters in the attention blocks to 3.5 times larger.For the decoder-only models, the PRR on GLUE for MHE-MUL stabilizes at 97.9% (i.e.1.1% lower) after scaling.Surprisingly, the PRR of MHE-MUL increases on WIKITEXT-103 (from 74.9% to 95.2%) and PENN TREEBANK (from 85.6% to 88.5%) while scaling to MEDIUM size.

Figure 1 :
Figure 1: Number of parameters for an attention sublayer and different number of attention heads using multi-head attention MHA and our multi-head embedding attention MHE.We fix the dimension of attention to 64, only counting the parameters for projecting queries, keys, and values.

Figure 2 :
Figure2: Multi-head attention (left) requires 3 × n projection matrices for queries, keys and values (W Q,K,V ) where n is the number of attention heads.Multi-head embedding attention (right) uses only three projection matrices and 3 × n head embeddings.

:
Motivated by the absolute position embedding(Devlin et al., 2019), we use the addition operation in Equation 5, represented as ψ(A, b) := A + b, where A ∈ {Q, K, V} and b ∈ {e Q , e K , e V } respectively.MHE-MUL:Likewise, motivated by the rotary position embedding(Su et al., 2021), MHE-MUL employs multiplication as the integrating operation in Equation 5 as ψ(A, b) := A ⊙ (b + 1), where ⊙ represents the Hadamard product.2Figure2shows an overview of the MHE mechanism compared to MHA.

Figure 3 :
Figure 3: Number of parameters per attention sublayer, while scaling the number of attention heads in different attention variants.We fix the dimension of attention to 64.

Figure 4 :
Figure4: Total number of parameters in attention sublayers, while scaling the number of attention layers to 12, 24 and 48 with 32 attention heads and 64 attention heads respectively.We fix the dimension of attention to 64.

Table 1 :
Results of the encoder-only architecture on GLUE, SUPERGLUE, SQUAD V1.1 and SQUAD V2.0 dev sets with performance retention ratio (PRR) and performance elasticity of parameters (PEoP) over five runs.Bold values denote best performing method in each benchmark.

Table 2 :
Results of decoder-only architecture on GLUE dev sets and WIKITEXT-103, PENN TREEBANK test sets with performance retention ratio (PRR) and performance elasticity of parameters (PEoP) over five runs.Bold values denote best performing method in each benchmark.

Table 2 .
Both our MHE-ADD and MHE-MUL achieve the highest PEoP scores on the two language modelling benchmarks (41.29, 42.32 on WIKITEXT-103 and 60.15 and 81.76 on PENN 5 For a detailed report on the memory usage of different attention mechanisms, see Appendix C.

Table 5 :
Results for encoder-only models on GLUE dev sets with standard deviations over five runs in parentheses.Bold values denote best performing method in each task.

Table 6 :
Results for encoder-only models on SUPERGLUE dev sets with standard deviations over five runs in parentheses.Bold values denote best performing method in each task.

Table 7 :
Detailed average scores and performance elasticity of parameters (in parentheses) on GLUE for MHE models and the baselines with encoder-only architecture using MLM as pre-training objectives.Underlined values denote the best performing method and bold values denote the method with best PEoP in each task.

Table 8 :
Detailed average scores and performance elasticity of parameters (in parentheses) on SUPERGLUE for MHE models and the baselines with encoder-only architecture using MLM as pre-training objectives.Underlined values denote the best performing method and bold values denote the method with best PEoP in each task.

Table 9 :
Results for decoder-only models on GLUE dev sets with standard deviations over five runs in parentheses.Bold values denote best performing method in each task.