Interpreting and Exploiting Functional Specialization in Multi-Head Attention under Multi-task Learning

Transformer-based models, even though achieving super-human performance on several downstream tasks, are often regarded as a black box and used as a whole. It is still unclear what mechanisms they have learned, especially their core module: multi-head attention. Inspired by functional specialization in the human brain, which helps to efficiently handle multiple tasks, this work attempts to figure out whether the multi-head attention module will evolve similar function separation under multi-tasking training. If it is, can this mechanism further improve the model performance? To investigate these questions, we introduce an interpreting method to quantify the degree of functional specialization in multi-head attention. We further propose a simple multi-task training method to increase functional specialization and mitigate negative information transfer in multi-task learning. Experimental results on seven pre-trained transformer models have demonstrated that multi-head attention does evolve functional specialization phenomenon after multi-task training which is affected by the similarity of tasks. Moreover, the multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.


Introduction
Transformer, based on the multi-head attention module, has been the dominant model for downstream applications due to its impressive results (Devlin et al., 2019;Brown et al., 2020;Dosovitskiy et al., 2021).However, it is still being utilized as a whole black-box model, and little is known about the functions of each sub-module on the final prediction.Simultaneously, although controversy still exists, there is overwhelming evidence that supports the idea of functional specialization in the human brain (Finger, 2001;Kanwisher, 2010).Such a functional specialization mechanism makes it easier for the human brain to handle multiple tasks and solve new problems.It can reuse existing resources and at the same time evolve specific regions to avoid the huge cost of redesigning.
Considering the benefits of functional specialization to human learning ability, it is interesting to explore whether a transformer model, especially its central module multi-head attention, would evolve a similar mechanism under multi-task training.If so, which factors will impact the degree of functional specialization in the multi-head attention module?And how to exploit this phenomenon to improve the generalization ability of Transformerbased models?
To investigate these questions, we first propose a method, called Important Attention-head Pruning (IAP), to quantify the degree of functional specialization in the multi-head attention of Transformerbased models.IAP first calculates the importance scores of each attention head on different tasks, then prunes the top important heads for each task to determine their impact on task performance.We apply our method to five different tasks with seven pre-trained transformers.Results show that the multi-head attention module has evolved distinct functional specialization phenomena across different sizes of BERT and pre-training methods.Further quantitative analysis indicates that there is a negative correlation between task similarity and the functional specialization phenomenon.
Moreover, we propose a multi-task learning method, namely Important Attention-head Training (IAT), to promote the segregation of functions in the multi-head attention module by training only the most important part of attention heads for each task.Experimental results on the GLUE dataset have demonstrated that our method alleviates the negative transfer among tasks and improves the performance of Transformer-based models on both multi-task learning and transfer learning without additional parameters.
To summarize, our main contributions are twofold: • We propose an interpretation method called IAP and find that the functional specialization phenomenon has evolved in multi-head attention after multi-task learning.Furthermore, empirical quantitative experiments show that such a phenomenon is influenced by the similarity between tasks: the more similar tasks are, the weaker the functional specialization phenomenon is.
• We propose an exploiting method called IAT to promote the degree of functional specialization.Experiments on multi-task learning and transfer learning validate that IAT is able to improve both the performance and generalization ability of multi-task learning models without adding any parameters.
2 Related Work

Interpreting Neural Networks
Interpreting attention module Analogous to visual attention, the distribution of attention weight over input is often used to interpret the final decision of attention-based model (Clark et al., 2019;Vig and Belinkov, 2019).Therefore, a lot of work has been done to study the interpretability of attention distribution (Jain and Wallace, 2019;Serrano and Smith, 2019;Jacovi and Goldberg, 2020) or design better explanation methods (Brunner et al., 2020;Kobayashi et al., 2020;Bai et al., 2021;Lu et al., 2021;Liu et al., 2022).Our work can be classified into another line of study: investigating the individual attention head in the multi-head attention module.Voita et al. (2019) argued that there are redundant heads in Transformer by pruning less important heads and analyzing the resulting performance, which is confirmed by Michel et al. (2019).Jo and Myaeng (2020) analyzed the linguistic properties of the sentence representations from attention heads by ten linguistic probing tasks.Hao et al. (2021) only retained the important heads in BERT and constructed an attribution tree to interpret the information interactions inside Transformer.
Through pruning attention heads, we study the role they play in different tasks, rather than show redundancy in the multi-head attention module (Michel et al., 2019).

Interpretation inspired by neuroscience
With more understanding of the functional specialization of the human brain, researchers attempt to interpret deep learning models with brain activities in specialized regions (Wehbe et al., 2014;Toneva and Wehbe, 2019;Zhuang et al., 2021;Bakhtiari et al., 2021).For example, Toneva and Wehbe (2019) studied the representations of NLP models across different layers by aligning with two groups of brain areas among the language network.
Unlike the existing works, we investigate whether the brain-like functional specialization phenomenon occurs in NLP models, and how to exploit this phenomenon to improve models.

Mitigating Negative Information Transfer in Multi-task Learning
By joint learning multiple tasks, the performance of a model on the target task can be boosted with regularization or sharing parameters among tasks (Collobert et al., 2011;Ruder, 2017;Liu et al., 2019a).However, multi-task learning models in NLP often suffer from negative information transfer and are inferior to the single task learning ones (Martínez Alonso and Plank, 2017;Bingel and Søgaard, 2017).
Our method aims to subdivide task-important modules in parameters shared to mitigate negative transfer among tasks, which is different from previous sampling or additional task-specific adapter methods (Wu et al., 2020;Pilault et al., 2021).We only need to preserve mask variables for each attention head rather than all parameters during training (Sun et al., 2020;Lin et al., 2021;Xie et al., 2021;Liang et al., 2021), which significantly reduce memory costs.

Multi-Head Attention Module
Transformer (Vaswani et al., 2017) extended single head attention function to Multi-Head Attention (MHA) module, which aims at capturing information from different representation subspaces in parallel.Given input X ∈ R n×d , this module linearly transforms it into n h subspaces and then applies attention separately: where Q, K ∈ R n×d k and V ∈ R n×dv .The outputs of all heads are concatenated and linearly transformed into the output space of this module: 3.2 Head Importance Score Michel et al. (2019) proposed an effective method to prune attention heads and evaluate the importance of attention heads for a task.In order to prune the attention head h, they incorporated a mask variable ξ h ∈ [0, 1] into the attention function: and set it to a zero value.When ξ h equals 1, Equation (3) is the same with the vanilla attention (Eq. (1)).The head importance score h of task T i is approximated by the expected sensitivity of loss function to the mask variable ξ h : where D (i) is the data distribution of task T i and L (i) (x, y) is the loss of task T i on sample (x, y).Different from Michel et al. (2019) which prune the least important attention heads to prove the redundancy of attention heads, this paper focuses on exploring the functional specialization phenomenon after training, thus we prune the most important heads for each task.

Method
Figure 1 illustrates the general procedure of our methods.Firstly, Transformer-based models are utilized for multi-task learning and may arise segregation of functions in the multi-head attention module.Subsequently, the important attention heads are determined and pruned to quantify the functional specialization in multi-head attention (Section 4.1).Lastly, the roles of important heads in each task are enhanced to promote the degree of functional specialization by important attentionhead training (Section 4.2).

Interpreting: Important Attention-head Pruning
We introduce a two-step method, namely Important Attention-head Pruning (IAP), to quantify the degree of functional specialization in multi-head attention.First, the top α ∈ [0, 1] percentage important heads H α i for task T i , e.g., the ones circled by dashed lines in Figure 1(II), are found after dualtask or multi-task training by their head importance scores.Specifically, we calculate the head importance score h , defined by Eq. ( 4), on training samples to approximate the contribution of head h to task T i .
Second, dissociation experiments are conducted to determine the degree of functional specialization in multi-head attention.Given a model f θ after dual-task training on tasks T A and T B , for example, the relative performance on T A after pruning the top α important attention heads for T B , denoted by H α B , is calculated as follows: where P(•) is the performance metric used, e.g., Accuracy, and (X A , Y A ) is the test sample of Task T A .Then, we estimate the degree of functional specialization by the relative performance difference after top α important heads for each task are pruned, called dissociation score: where D A (α) denotes the dissociation score of task T A , and D(α) is the average dissociation score of this dual-task learning.Given an appropriate α, a larger dissociation score implies a higher degree of functional specialization.
Similarly, the dissociation score of task T i under multi-task learning is measured via: (7) To clearly illustrate the functional specialization phenomenon, we summarize two representative cases under dual-task learning: • Double dissociation when D A (α) > 0 and D B (α) > 0. This is a significant indicator of functional specialization.That is, each task requires a unique group of heads, which can be selectively masked.To eliminate the accidental functional specialization phenomenon, we argue that a distinct one occurs if the average dissociation score is higher than or equal to 10%, i.e., D(α) ≥ 10%, in which 10% is chosen according to the definition of double dissociation in neuroscience (Shallice, 1988).
One significant positive dissociation score suggests functional specialization may only arise in this task.
The dissociation scores may be both negatives, which arise from the wrong evaluation of the important heads for each task.It can be summarized into the double dissociation case under the correct evaluation and pruning.In the other cases, e.g, the dissociation scores of both tasks are relatively small, we argue that there is no functional specialization in the multi-head attention module.Specifically, the influence on all tasks will be almost identical when pruning different groups of heads.

Exploiting: Important Attention-head Training
Motivated by the high degree of functional specialization in human brain, it is interesting to investigate whether a higher degree of functional specialization could improve the performance of the model on multi-task learning or transfer learning.
To promote the degree of functional specialization in multi-head attention, we design a multi-task training method, named Important Attention-head Traning (IAT).Specifically, only the top α ∈ [0, 1] important attention heads for task T i are tuned at the last δ ∈ [0, 1] multi-task training process, and the parameters other than the multi-head attention module are trained as before.To achieve this, we introduce a mask variable M i ∈ {0, 1} n h for task T i , where 1 indicates to fine-tune this attention head for T i .For example in Figure 1(III), only the mask variables of heads circled by the solid blue line are set to 1 for T n .When α = 1 or δ = 0, our method is the same as the normal multi-task learning method.
We expect to consolidate the roles of important heads for each task and facilitate the functional separation of multi-head attention in this way.

Datasets
We select a topic classification datasets (Zhang et al., 2015), eight natural language understanding datasets of GLUE (Williams et al., 2018;Rajpurkar et al., 2016;Wang et al., 2019), and two datasets (Maas et al., 2011;Khot et al., 2018) for transfer learning in this study.To avoid an extreme ratio of training samples between tasks, only five large datasets in different tasks, which contain more than 10k training samples, are preserved in dual-task and multi-task learning interpretation experiments.Like Karimi Mahabadi et al. ( 2021), SciTail and IMDB are used only in transfer learning.Statistics of all datasets used are shown in Table 1.

Models
As shown in Table 2, seven Pre-trained Transformer Models (PTMs), including GPT family models, different sizes of BERT and different pre-training methods (Radford et al., 2018(Radford et al., , 2019;;Devlin et al., 2019;Liu et al., 2019b;Jiao et al., 2020;He et al., 2021), are investigated in this paper.These models are all initialized from the transformer library of   HuggingFace (Wolf et al., 2019).Hyperparameters are reported in Appendix A.   dual-task learning tasks.We observe that the dissociation scores of models without frozen in dual-task learning are all positive, i.e., double dissociation phenomenon appears in all task-pairs (details are shown in Appendix B).As illustrated in Figure 2, BERT BASE shows a distinct functional specialization phenomenon (D(α) > 10%) in four dualtask learning tasks.Moreover, distinct functional specialization phenomena are also found in the other two sizes of BERT and GPT models.The other two base-size models, RoBERTa BASE and DeBERTV3 BASE , even show a higher degree of functional specialization, in which average dissociation scores among ten dual-task learning tasks are 13.44% and 10.88% respectively.
To eliminate the accidental functional specialization phenomenon, we train another dual-task model using a frozen BERT BASE encoder for com- parison.As shown in the fifth row of Figure 2, most of the dissociation scores are relatively small, and only one dual-task pair, "MNLI and AG", shows a mild functional specialization phenomenon (D(α) > 5%).The average dissociation score of these ten task pairs spontaneously increases by 6.32% if we fine-tune the shared encoder.

Multi-task Learning
We further conduct multitask learning experiments using all five tasks in dual-task learning.In addition to all positive dissociation scores, we find that the performance of one task decreases more when pruning the top 30% important attention heads of this task compared with other tasks (Table 3).It demonstrates that the functional specialization phenomenon has evolved after multi-task learning, i.e., there is a unique group of heads more important to one specific task.Otherwise, the influence on all tasks would be similar when pruning the most important attention heads for different tasks.
The absolute performances on the first three tasks (MNLI, QQP, and QNLI) suffer a drastic drop after pruning only 30% attention heads.For example, the lowest drop is 14.41% on MNLI when the top 30% important heads for SST-2 are pruned, while the highest one is only 14.49% among the AG and SST-2 when pruning the same amount of attention heads.It indicates that tasks taking two sequences as input, e.g., natural language inference and question answering, depend on attention mechanism more than one sequence input task, which is in line with the finding of Vashishth et al. (2019).See Appendix C for more details and analyses.

Task Similarity Affects Functional Specialization
After observing the functional specialization phenomenon in the multi-head module, it is interesting to study how this phenomenon is affected.In this section, we empirically explore two factors: task similarity and input paradigm.Task Similarity The task similarity metric Cognitive-Neural Mapping (CNM), which is found less sensitive to underlying models (Luo et al., 2022), is utilized to quantify the similarity of taskpair in this section.As shown in Figure 3, we observe that there is a significant negative correlation between the average dissociation score of task-pair and the similarity between tasks.In other words, the more similar the tasks are, the lower the average dissociation score is, which suggests the weaker the functional head specialization phenomenon is.The other three task similarity metrics used and fitting results refer to Appendix D, where this negative relationship is also found.
Input Paradigm There are two different input paradigms, sentence pair (MNLI, QNLI, and QQP) and single sentence (AG and SST-2), among these five tasks.We notice the average dissociation score of two tasks in different input paradigms is higher than the same input paradigm ones in Figure 2 (BERT BASE : 12.654% > 3.016%).Thus, experiments are conducted to investigate the effect of input paradigm on the degree of functional specialization in multi-head attention.Specifically, we construct a dataset named "AG-Pair" using the sentences of AG dataset, which aims to identify whether a pair of input sentences belong to the same topic.The number of samples in AG-Pair is the same as AG, which is 120k, and each sample in AG occurs twice in the AG-Pair dataset.The generation method and statistics of AG-Pair are reported in Appendix E.
As shown in Table 4, there is no significant dissociation score difference between "AG + QNLI" and "AG-Pair + QNLI" dual-task learning tasks, which also holds for "AG + SST-2" and "AG-Pair + SST-2".We note that the absolute performances on the AG-Pair dataset suffer a drastic drop after pruning only 30% attention heads, which is similar to the other tasks taking a pair of sentences as input (Table 3).
According to the experimental results presented above, we observe that task similarity plays a more important role than the input paradigm in the functional specialization of the multi-head attention module.

Improving Multi-Task Models by Training Important Attention Heads
Once the importance of attention heads for each task is figured out, we should be able to consolidate their roles by only finetuning them.Thus, Important Attention-head Training (IAT) (Section 4.2) is applied to the multi-task learning models on 9 GLUE datasets and compared against vanilla multi-task learning.We observe that the degree of functional specialization in the multi-head attention module is improved by training the top important attention heads during the last part of multi-task learning (details refer to Appendix F).
Table 5 reports on a comparison result of single task fine-tuning models, multi-task learning models as well as the models using adapters on GLUE test set.2GPT and GPT-2 are not incorporated due to their inferior performance on GLUE.With important attention-head training, the average performances of five multi-task learning models are increased by 0.76% on average over the vanilla multi-task learning baseline.These transformer family models for multi-task learning even surpass their single task fine-tuning counterparts, which consist of 9 task-specific models.
In most cases, multi-task learning models with IAT receive a performance gain on the four small datasets (CoLA, MRPC, RTE, and STS-B), among which the improvement on CoLA is the most significant (+3.6% on average).It comes from the alleviation of negative transfer in CoLA under multi-task learning.For example, compared with fine-tuning on CoLA (60.5%), the performance of BERT LARGE drops to 56.8% under multi-task learning, while it increases to 60.0% after using IAT.The performances of multi-task learning models on two large datasets, QQP and SST-2, are also improved by our method.More results, including different sampling methods and performances on GLUE development sets, are shown in Appendix F.
Few-shot Transfer Learning Furthermore, we investigate whether a multi-task learning model with a more specialized multi-head attention module will be better at transfer learning.Table 6 presents the few-shot transfer learning results using different amounts of training samples from Sc-iTail (natural language inference task) and IMDB (sentiment analysis task).We find that the model initialized from a multi-task learning model using IAT achieves a higher accuracy on the new task, especially when fewer samples are provided.IAT degrades to the multi-task learning method proposed by Liang et al. (2021) when δ = 1, and often obtains a worse performance in multi-task learning and transfer learning (Ticket-Share in Table 5 and Table 6).It may come from the weak functional specialization phenomenon in the original pre-trained models (e.g., the frozen BERT BASE encoder in Figure 2), which makes it harder to correctly determine the most important attention heads for each task at the beginning of multi-task training.
Ablation Study To take a deep look into the improvements contributed by important attentionhead training, we conduct an ablation study on GLUE dev set using BERT BASE (Table 7).After pruning the least important 30% heads, there is a performance gain on three tasks (MRPC, SST-2, and STS-B), which is in line with the previous finding that Transformer can be improved by pruning some redundant attention heads (Michel et al., 2019).
It is interesting to find that multi-task models can benefit from training random 30% attention heads for each task, which may arise from the mitigation of gradient interference by subdividing the parameters shared.Compared with training random 30% attention heads, training the most important part of attention heads can further improve the average performance and benefit more tasks, which is confirmed in the ablation study of few-shot transfer learning (Table 8).

Conclusions and Future Work
In this paper, we conduct extensive dissociation experiments and observe that the brain-like functional specialization phenomenon does evolve in multi-head attention after dual-task or multi-task learning.Furthermore, experimental results show that the performance and generalization ability of multi-task models can be improved by the multitask training method based on functional specialization.This work, inspired by neuroscience findings, studies the interpretation and improvement of neural networks, which we hope will promote more efforts on interdisciplinary work combining neuroscience and artificial intelligence.
In the future, we plan to investigate more neural network modules that may arise the functional specialization phenomenon under multi-task learning.Another direction is to design better methods exploiting this phenomenon to further improve multi-task learning models.

Limitations
Firstly, we conduct extensive experiments on multiple natural language understanding tasks only, and multi-modal tasks could be investigated further.
In addition, only one approach is utilized to estimate the importance of each attention head, and the most important attention heads are pruned at once.Because of this choice, our results can be seen as a lower bound on the estimation of functional specialization in multi-head attention.We acknowledge that there might be methods to show higher dissociation scores, such as adopting other attention head importance estimation methods (Hao et al., 2021;Li et al., 2021) or iterative pruning.
We note that the four similarity metrics used in this study are model-dependent, and recognize that results might be different for other Transformerbased models.
Lastly, there are two hyper-parameters introduced in our multi-task training method, which may need extra tuning when adapted to other multitask learning settings.

A.1 Dual-task and Multi-task Learning
To fine-tune the pre-trained models on dual-task or multi-task learning, we use Adam optimizer (Kingma and Ba, 2015), in which β 1 = 0.9 and β 2 = 0.999, and a learning rate of 2e-5.We also use a linear warm-up schedule and set the warm-up proportion to 0.1.The number of epochs is empirically set to 5 for a fair comparison.The only exception is the distillation of TinyBERT, which contains intermediate layer distillation and prediction layer distillation.Under the supervision of a fine-tuned BERT BASE , these distillation methods are performed for 2 and 3 epochs without augmented data, respectively.Unless otherwise specified, the proportional sampling method is utilized in learning.
Similar to the difference in area between cortical regions, the best α for each task may be different in dissociation experiments.We acknowledge that higher dissociation scores can be obtained by finetuning α in each dual-task learning task.For a fair comparison, α is empirically set to 30% in all dissociation experiments to show the extent of functional specialization in the multi-head attention module.All experiments are repeated under three random seeds and average results are reported.

A.2 Transfer Learning
Since only a small part of training samples are used in transfer learning experiments, we increase the number of training epochs to 20, and conduct a paired bootstrap statistical test under 30 random seeds (Dror et al., 2018).

B Dual-task Learning Experiments
In this section, we present the results of all multihead attention based models investigated in dualtask learning tasks.
As reported in Table 9, the dissociation scores of Transformer-based models in dual-task learning are all positive when fine-tuning the pre-trained encoder, i.e., double dissociation phenomenon appears in all task-pairs.It further demonstrates that the functional specialization phenomenon does appear in the multi-head attention module after training on these dual-task learning tasks.

C Multi-task Learning Experiments on BERT BASE
We report more results of multi-task learning experiments conducted in Section 6.2.The pair-wise dissociation scores are reported in Table 10.

Distribution of Heads Pruned
To gain more insights about the functional specialization in multitask learning, we statistic the distribution of heads pruned for each task across layers in multi-task learning (Figure 4).The average number of attention heads pruned shows a trend of increasing first and then decreasing, which changes at the 4th layer.
The two layers with the greatest difference among tasks are the first layer (σ = 2.39) and the sixth layer (σ = 2.08) of BERT BASE after fine-tuning 5 epochs on these five tasks.Overlapping of Heads Pruned Table 11 reports the overlapping of attention heads pruned between tasks.It seems that the proportion of overlapping heads pruned does not completely correspond to the dissociation score of each task (Table 10).For example, as for the MNLI and AG tasks, the task with the highest overlapping of heads pruned is the same as the one with the lowest dissociation score.However, the highest overlapping of heads pruned for SST-2 comes to the second-highest dissociation score when combined with MNLI.

D Task Similarity Metrics and Fitting Results
To verify the robustness of our finding in Section 6.2, the following four metrics are adopted to determine the similarity of each task pair:  Cognitive Representation Analytics (CRA) Inspired by Representational Similarity Analysis (RSA) in cognitive neuroscience (Kriegeskorte et al., 2008), CRA first calculates the Representation Dissimilarity Matrix (RDM) by the dissimilarity of sentence representations, then approximates the similarity between tasks by the similarity between the corresponding RDMs (Luo et al., 2022).Cognitive-Neural Mapping (CNM) CNM calculates the task similarity by mapping sentence representations of fine-tuned models to fMRI data (Luo et al., 2022), which is recorded when 5 participants were intently reading presented 384 passages (Pereira et al., 2018).Different from randomly selecting 25k fMRI voxels, the most informative 5k fMRI voxels for each participant are used to predict the similarity among tasks.Results with CNM have been shown in Figure 3.
To sum up, we observe that there is a negative correlation between the average dissociation score and the task similarity, no matter which task similarity metric is adopted.

E AG-Pair dataset
The AG-Pair dataset is built from the original dataset AG's News that contains 120k training samples from four topics.Given a pair of news as input, the model has to predict whether they are belonging to the same topic (Same) or not (Different).
To generate this dataset, samples in AG are iterated in random order and have an equal chance to combine a sample in the same topic or the other three topics.Thus the numbers of training samples in two classes are both 60k.Moreover, each news in AG's News occurs exactly twice in the AG-Pair dataset to keep the same word frequency.

F Other Experimental Results on GLUE
In this section, we report more results and analyses of multi-task learning models on GLUE. Figure 8 illustrates that the average dissociation scores of five Transformer-based models are all improved by IAT as we expected.Same as the finding in Stickland and Murray (2019), the annealed sampling method is better for multi-task learning of GLUE than the proportional sampling method.The sampling probabilities of task i in annealed sampling are changed with epoch  where N i is the number of samples in task i, E is the total number of epochs.In contrast, the ε in proportional sampling is always equal to 1.
Table 12 shows the results of five multi-task learning models using the proportional sampling method on GLUE test set.We can find that these multi-task learning models with proportional sampling perform better on GLUE test set after using IAT (+0.68% on average), which is in line with the findings in Section 6.3.It further demonstrates the effectiveness of our method.
Additional experimental results on development sets of GLUE for all models tested in this paper are reported in Table 13.In most cases, the standard deviation of average performance on GLUE development set is less than or equal to the baseline after using IAT, which indicates the robustness of our method.

Figure 1 :
Figure 1: Illustration of how to quantify and improve the degree of functional specialization in multi-head attention for Transformer-based models.Only attention heads, which are our research target, are depicted in the model for simplicity.(I) Multi-task learning using Transformer-based models.(II) Quantify the functional specialization phenomenon by determining and pruning the important heads for each task.(III) Improve the functional specialization phenomenon by only fine-tuning the important heads for each task in the last part of multi-task learning process.

Figure 2 :
Figure2: Average dissociation scores of different Transformer-based models (y-axis) after ten dual-task learning tasks (x-axis) with α = 30%.The larger dissociation score implies a higher degree of functional specialization in multi-head attention (Section 4.1).All dissociation scores are reported in Table9.* indicates that the parameters of BERT BASE encoder are frozen, i.e., the output layers are fine-tuned only.

Figure 3 :
Figure 3: The average dissociation score and similarity of each task-pair in multi-task learning.

Figure 4 :
Figure 4: The number of important heads pruned among the layers of BERT BASE after multi-task learning.The average number of heads pruned in one layer is 3.6 (α = 30%).

Figure 6 :
Figure 6: The average dissociation score and AHP similarity of each task pair in multi-task learning.

FigureFigure 7 :
Figure 7: The average dissociation score and CRA similarity of each task pair in multi-task learning.

Figure 8 :
Figure 8: Average dissociation score of five multi-task learning models on GLUE dev set with α = 30%.

Figure 9
Figure9and 10 present the impact of two hyperparameters, δ and α in IAT, on the average performance of BERT BASE .It is interesting to find that with a small δ and α (e.g., δ = 10% and α = 30%), BERT BASE using IAT can achieve a good performance on GLUE dev set.Therefore, we only consider a limited hyperparameter sweep for each multi-task learning model with δ ∈ {0.05, 0.1, 0.15} and α ∈ {0.1, 0.2, 0.3}.

Figure 9 :
Figure 9: The average performance of BERT BASE on GLUE dev set using IAT with different δ (α = 50%).

Figure 10 :
Figure 10: The average performance of BERT BASE on GLUE dev set using IAT with different α (δ = 10%).

Table 9 .
* indicates that the parameters of BERT BASE encoder are frozen, i.e., the output layers are fine-tuned only.

Table 1 :
Statistic of datasets used.
⋆ denotes dataset used in dual-task and multi-task interpretation experiments.

Table 2 :
Statistic of models used.#L=the number of layers, #A=the number of attention heads per layer.

Table 3 :
Performance(%) of the pruned and base model on each task using BERT BASE with α = 30%.T † denotes top α important heads for this task are pruned.The lowest value is underlined.

Table 4 :
Base Acc.Task A † Task B † Base Acc.Task A † Task B † DA(30%) DB(30%) D(30%) Comparison between different input paradigm combinations.The input of AG-Pair is a pair of sentences from AG, and the label is whether they belong to the same topic.

Table 5 :
GLUE test set results using the GLUE evaluation server."ST" stands for the single task fine-tuned model, whereas "MTL" denotes the multi-task learning model.The multi-task learning models we tested are not further fine-tuned on each task, so there is only one model for all tasks (1.0× in #Params).Results from: Devlin et al. (2019) 1 , Stickland and Murray (2019) 2 , Pilault et al. (2021) 3 , Houlsby et al. (2019) 4 .‡ indicates our implement result for a fair comparison.The highest performance in the last two conditions of each model is displayed in bold.

Table 6 :
Few-shot transfer learning results on development sets across 30 seeds ( * indicates statistically significant improvements of 5% level).All models use BERT BASE as encoder and are initialized from their multi-task learning models on GLUE.

Table 7 :
Ablation study of different multi-task methods on GLUE dev set with δ = 10%.

Table 8 :
Ablation study of 4-shot transfer learning using different multi-task learning models on GLUE.

Table 9 :
Results in dual-task learning experiments under α = 30%.* indicates that the parameters of BERT BASE encoder are frozen.

Table 10 :
D A (α) between task-pairs, which is calculated on the pruning results of multi-task learning with α = 30%.The highest dissociation score in each task A is displayed in bold, and the lowest one is underlined.

Table 11 :
The overlapping percentage of important heads pruned in multi-task learning under α = 30%.The highest overlapping in each task A is displayed in bold, and the lowest one is underlined.