The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders

Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.


Introduction
Transformer encoders (TEs) have established recent state-of-the-art results on many core NLP tasks (He and Choi, 2019;Yu et al., 2020;Zhang et al., 2020). However, their architectures can be viewed "over-parameterized" as downstream tasks may not need all those parameters, prone to cause an overhead in computation. One promising approach to mitigate this overhead is multi-task learning (MTL) where a TE is shared across multiple tasks; thus, it needs to be run only once to generate final embeddings for all tasks (Clark et al., 2019b).
Despite the success in MTL on closely-related tasks such as language understanding (Wang et al., 2018) or relation extraction (Chen et al., 2020;Lin et al., 2020), MTL on core NLP tasks (e.g., tagging, parsing, labeling) whose decoders are very distinct has not been well-studied. This work employs the state-of-the-art decoders on five core tasks for MTL and thoroughly analyzes interactions among those tasks to explore a possibility of reducing the computation overhead from TEs. Surprisingly, our experiments depict that models jointly trained by MTL give lower accuracy than ones trained individually, that is against findings from previous work. In fact, models jointly trained with all five tasks perform the worst among any other combination (Section 3).
These experimental results urge us to figure out why MTL on core tasks with a shared TE leads to worse performance than its single-task counterparts. Our exploration begins by detecting essential heads for each task by forcing the TE to use as few attention heads as possible while maintaining accuracy similar to a fully-utilized encoder. Our experiments reveal that all five tasks rely on almost the same set of attention heads. Hence, they compete for those heads during MTL, causing to blur out features extracted by individual tasks. Thus, we propose the Stem Cell Hypothesis, likening these talented attention heads to stem cells, which cannot be fine-tuned for multiple tasks that are very distinct (Section 4).
To validate this hypothesis, many parameter-free probes are designed to observe how every attention head is updated while trained individually or jointly. Intriguingly, we find that heads not fine-tuned for any task can still give remarkably high performance to predict certain linguistic structures, confirming the existence of stem cells inherently more talented; it is consistent with previous work stating that TEs carry on a good amount of syntactic and semantic knowledge (Tenney et al., 2019;Liu et al., 2019a;Jawahar et al., 2019;Hewitt and Manning, 2019). After single-task learning, probing results typically improve along with the task performance, illustrating that the stem cells are developed into more taskspecific experts. On the contrary, MTL often drops both probing and task performance, supporting our hypothesis that attention heads lose expertise when exposed to multiple teaching signals that may conflict to one another (Section 5).
The Stem Cell Hypothesis is proposed to shed light on a possible direction to MTL research using TEs, comprising an unbearable amount of parameters, by wisely assigning attention heads to downstream tasks. Although most analysis in this study is based on BERT, we also provide extensive experimental results and visualization of other recent TEs including RoBERTa (Liu et al., 2019c), ELECTRA (Clark et al., 2020) and DeBERTa (He et al., 2020) in §A.4 to further demonstrate the generality of our hypothesis. To the best of knowledge, this is the first time that a comprehensive analysis of attention heads is made for MTL on those core tasks by introducing novel parameter-free probing methods. 1

Related Work
A small portion of our work overlaps with multitask learning. MTL with pre-trained transformers specifically in NLP (Wang et al., 2018;Clark et al., 2019b;Liu et al., 2019b;Kondratyuk and Straka, 2019;Chen et al., 2020;Lin et al., 2020) has been widely studied. Most work focus on neural architecture design to encourage beneficial message passing across tasks. Our MTL framework adopts conventional architecture and applies tricks of batch sampling (Wang et al., 2019) and loss balancing.
Most of our work falls into the analysis of BERT, especially from a linguistic view. Since BERT was introduced, studies on explaining why BERT works have never stopped. The most related studies are those trying to study the linguistic structures learnt by BERT. Among them, Tenney et al. (2019) and Liu et al. (2019a) showed part-of-speech, syntactic chunks and roles can be discovered from BERT embeddings. Using a supervised probe, Hewitt and Manning (2019) successfully discover full dependency parse trees. The encoded dependency structure is also supported by Jawahar et al. (2019) using probes on embeddings. Apart from these parameterized probes, parameter-free approaches (Clark et al., 2019a;Wu et al., 2020) also agree with the existence of rich linguistic knowledge in BERT, which is closely related to our probing methods.
What remains unclear is the impact of fineturning on TEs. Using supervised probes, Peters et al. (2019) claim that fine-tuning adapts BERT embeddings to downstream tasks, which is later challenged by Hewitt and Liang (2019) since su-1 All our resources including source codes and models are public available at https://github.com/emorynlp/ stem-cell-hypothesis.
pervised probe itself can encode knowledge. Then, Zhao and Bethard (2020) propose a methodology to test such encoding of a linguistic phenomenon by comparing the probing performance before and after fine-tuning. Our probing methods align with these unsupervised probes while focus more on explaining the impact of multi-task learning.

Multi-Task Learning
Our goal of MTL is to build a joint model sharing the same encoder but using a distinct decoder for each task that outperforms its single-task counterparts while being faster and more memory efficient. Our model adapts hard parameter sharing (Caruana, 1993) such that all decoders take the same hidden states generated by the shared encoder as input and make task-specific predictions in parallel.

Shared Encoder
For main experiments, BERT (Devlin et al., 2019) is used as the shared encoder although our approach can be adapted to any transformer encoders ( §A.4). Every token gets split into subtokens by BERT; eventually, the average of the last layer's hidden states generated for those subtokens is used as the final embedding of that token. Additionally, word dropout is applied for generalization by replacing random subtokens with [MASK] during training.

Task-Specific Decoders
Five tasks are experimented, part-of-speech tagging (POS), named entity recognition (NER), dependency parsing (DEP), constituency parsing (CON), and semantic role labeling (SRL). For each task, a state-of-the-art decoder is adopted (except for POS) to provide a modern benchmark for MTL on these tasks, and simplified to build an efficient model.  Table 1: Performance of single-task learning (main diagonal highlighted in gray), multi-task learning on all 5 tasks (MTL-5), and multi-task learning on every pair of the tasks (non-diagonal cells; e.g., DEP'th row in NER'th column is the DEP result of the joint model between DEP and NER). See also Table 12 for similar results of other TEs. approach. Also, the final embedding of [CLS] from BERT is used to represent the root node.

Data and Loss Balancing
During multi-task training, batches from different tasks are shuffled together and randomly sampled to optimize the shared encoder and the corresponding decoder. Following Wang et al. (2019), a task is sampled based on a probability proportional to its dataset size raised to the power of 0.8. To balance the losses of all tasks, a running average of every task is monitored and its loss is updated as follow: L t is the current loss of the task t,L t is the running average of the most recent 5 losses of t, and L t is the updated loss of t. This balancing method normalizes the loss of each task to the same magnitude and has shown to prevent MTL from being biased to specific tasks in our preliminary experiments.

MTL Experiments
Our models are experimented on the OntoNotes 5 (Weischedel et al., 2013) using the data split suggested by Pradhan et al. (2013). Table 1 illustrates performance of all models using the following evaluation metrics -POS: accuracy, NER: span-level labeled F1, DEP: labeled attachment score, CON: constituent-level labeled F1, SRL: micro-averaged F1 of (predicate, argument, label). Every model is trained 3 times and their average score and standard deviation on the test set are reported. For DEP, the gold trees from CON are converted into the Stanford dependencies v3.3.0 (de Marneffe and Manning, 2008). Detailed descriptions about the experimental settings are provided in Appendix A.2. Single-task learning models are first trained then compared to the MTL model trained on all 5 tasks (MTL-5). Interestingly, MTL-5 is outperformed by its single-task counterparts for all tasks. Due to the high complexity of MTL-5, it is hard to tell which combinations of tasks introduce negative transfer. Thus, we conduct MTL on every pair of the tasks to observe if there is any task combination that yields a positive result (non-diagonal cells in Table 1). Among the 10 pairwise task combinations, none of them derives a win-win situation. NER results are generally improved with MTL although results on the other tasks are degraded, implying that NER takes advantage of the other tasks by hurting their performance. SRL is also benefited from CON although it is not the case for the other way around. Results of other recent TEs reveal similar patterns as shown in Appendix A.4.

Pruning Analysis
To answer why MTL leads to suboptimal results in Section 3, pruning strategies are applied to BERT such that only attention heads absolutely necessary to get the best performance are kept for every task. This allows us to see if there exists a common set of heads that multiple tasks want to claim and train for only their objectives, which can cause conflicts for those heads to be shared across all tasks.  Table 2: Results of single-task learning (STL), STL with static pruning (STL-SP), STL with dynamic pruning (STL-DP), and multi-task learning on the 5 tasks with dynamic pruning (MTL-DP). PS/S: processed samples per second for speed comparison. The STL Performance column is equivalent to the main diagonal in Table 1. See also

Pruning based on
Unfortunately, these binary variables z = {z j : ∀ j=[1, ] } ( : total # of heads) are discrete and nondifferentiable so cannot be directly learnt using gradient based optimization. To allow for efficient continuous optimization, each z j is then relaxed as a random variable drawn independently from a continuous random distribution. Specifically, the relaxed z is re-parameterized by its inverse of the cumulative density function (CDF) as G α (u). It is sampled as follows, where α is a learnable parameter of the inverse CDF, U is the uniform distribution over the interval [0, 1] and u = {u j : ∀ j=[1, ] } denotes the iid samples from it: Then, the Hard Concrete Distribution (Louizos et al., 2018) is chosen for z, which gives the following form of G α (u) that is differentiable, where (l, r) defines the interval that g α (u) can be stretched into (l < 0, r > 1): By sampling u and applying the Monte Carlo approximation, the learnable L 0 -objective is obtained in a closed form, which gets jointly optimized with a task specific loss or the balanced MTL loss: (1)

Pruning Strategies
Two types of pruning strategies, static and dynamic, are applied for the attention head analysis: Static Pruning We refer to the conventional twostage train-then-prune as static pruning (SP) since it fine-tunes the encoder first then freezes the decoder for pruning (Voita et al., 2019).
Dynamic Pruning Since SP requires twice the efforts to obtain a pruned model, we propose a new method that simultaneously fine-tunes and prunes. This strategy is referred to as dynamic pruning (DP) since the decoder dynamically adapts to the encoder that is being pruned during training, as opposed to SP which instead freezes the decoder. DP is found to be more effective in our experiments. All pruning models are trained for 3 runs with different random seeds and the best checkpoints by scores on development sets are kept. Once trained, E u∼U (0,1) [z] ∈ (0, 1) is used as a measure of how much each head is being utilized. Table 2 shows single-task learning (STL) results using SP and DP on the 5 tasks. Our DP strategy consistently performs better than the SP strategy as it shows higher accuracy on all tasks and prunes significantly greater numbers of heads except for CON. Compared to the STL models without any pruning, the STL-DP models perform well or slightly better for POS/NER/SRL due to the L 0 -regularization, yet use ≈50% fewer numbers of heads.   Comparing the STL-DP models across different tasks, SRL requires more heads than DEP and CON, which require more than POS. This aligns with the intuition behind their difficulty levels as semantic > syntactic > lexical relations. On the other hand, NER requires more heads than CON because of the world knowledge it needs to capture from the data; thus, this knowledge is more scattered.

Pruning Experiments
Finally, DP is applied to MTL-5 (MTL-DP), which shows slightly higher accuracy than MTL-5 in Table 1 except for NER by pruning 50% of the heads. This might imply that all tasks want to claim a similar set of heads even though about a half of the heads are underutilized during MTL training.

Pruning Visualization
To visualize the utilization of heads across runs, the utilization rate z (r) j,t of the j'th head in the r'th run for the task t is encoded to the RGB channels: For instance, RGB of (0, 0, 0) is black indicating that the head is 100% utilized in all 3 runs. Based on this scheme, the head utilizations of all STL-DP models as well as the MTL-DP model are plotted in Figures 1a ∼ 1f. To depict the overlaps of utilized heads across tasks, z's are averaged over all STL-DP models for all runs then plotted as a grayscale heatmap ( Figure 1g) using the following scheme: Table 3, the head utilization per task seems quite similar across different runs, especially for syntactic/semantic tasks such as DEP/CON/SRL. The head utilization of POS seems to be random because it is a simple task so that high performance can be achieved by a small set of the re-utilized heads. This consistency across different runs is an essential prerequisite for the following analyses.  Table 3: Adjusted R-squared of 3-run head utilization rates using the third run as the dependent variable (main diagonal highlighted in gray) and Pearson Correlation Coefficient of averaged head utilization rates between each pair of models (non-diagonal cells).

Consistent head utilization across tasks
In contrast to Jawahar et al. (2019) and Tenney et al.
(2019), our findings suggest that the middle layers also provide rich surface and semantic features, which are aligned with Liu et al. (2019a) showing that both POS and chunking tasks perform the best when heads from the middle layers are utilized.

Consistent head utilization by STL and MTL
Figures 1f and 1g illustrate almost the identical utilization patterns, implying that the MTL-DP model re-uses a very similar set of heads used by the STL-DP models. According to Vaswani et al. (2017), the representation capacity of every head is limited by the design of multi-head attention. Since (1) a similar set of heads are used across multiple tasks and (2) the limited representation capacity of individual heads confines them to only specific tasks, forcing them for MTL leads to worse results. Given this analogy, we propose the following hypothesis: There exists a subset of attention heads in a transformer called "stem cells" that are commonly used by many tasks, which cannot be jointly trained for multiple tasks that are very different in nature.
We refer to this claim as the Stem Cell Hypothesis and seek to test it through the probing analysis.

Probing Analysis
This paper hypothesizes the existence of stem cells, which cannot be trained to create adequate embeddings to be shared by multiple tasks that are not so similar. This section provides empirical evidence to this hypothesis by probing what roles each attention head plays once fine-tuned for end tasks.

Probing Methods
Previous studies on probing transformer encoders have focused on layer-level analysis limited to supervised probing ( 2019; Zhao and Bethard, 2020). This section introduces probes on the head-level instead to analyze the impact of fine-tuning on every individual head. Since developing supervised probes on hundreds of heads requires extensive resource, parameter-free probing methods are used in this study.
Attention Probes Attention between two words often matches a certain linguistic relation that gives a good indicator to knowledge encoded in the head. Our decoders for DEP and SRL learn relationships between head/dependent words and predicate/argument words respectively, which can be directly benefited from these attentions. Thus, the attention matrix from each head is used as the probe of that head. Following Clark et al. (2019a), an undirected edge is created between each word and its most attending word. First, the subtoken-subtoken attention matrix is converted into a word-word matrix by averaging the attention probabilities of each multi-subtoken word. The arg max of each row r in the attention matrix is then calculated, denoted as g r , and evaluated on the basis of each task. For DEP, directions of the gold arcs are removed and compared against the predicted arcs as follows (h|d: the index of a head|dependent word, (h, d): an undirected arc from the gold tree, n: # of arcs): For SRL, we design a new probing method to evaluate how each word in the argument span is attended to the head word in its predicate (p: the index of a predicate head word, T p : word indices in the span of p's argument, m: # of predicate-argument pairs): Only head words in the predicates are used for this analysis, which affects verb-particle constructions (e.g., only throw is used for throw away). Moreover, not all words in an argument span are necessarily important to add meaning to its predicate. We will explore these aspects in the future.
Attended-Value Probes POS/NER/CON can be viewed as tasks to find and label spans in a sentence, where the span is a word for POS, a sequence of consecutive words for NER and CON, where a span can be overlapped with another span for CON. For these tasks, we again present a new probing method, depicted in Algorithm 1, that predicts the label of each span based on its representation: The attended-value matrix H ∈ R n×d is created by multiplying the attention matrix A ∈ R n×n to the value matrix V ∈ R n×d (Section 4.1) such that H = AV (n: sentence length, d: embedding size, abbreviated d k ). S is the set of gold spans, m is the total number of labels, (b, e, ) denotes the indices of the beginning word, the ending word, and the label respectively, and cossim is a cosine similarity function with broadcasting enabled. With Algorithm 1, the centroid of each label is obtained through PseudoCluster then used to predict labels of all spans. Note that for CON, only constituents on the height-3 (right above the POSlevel) are used for this analysis. We experimented with constituents on higher levels, which did not show good correlation with model performance as the spans got longer and noisy. We plan to design another probing method for deeper analysis in CON.

Probing Experiments
Probing experiments are conducted on all attention heads in the pre-trained BERT (Devlin et al., 2019) and fine-tuned models trained by single-task learning (STL; diagonal in Table 1), pairwise multi-task learning (MTL; other cells in Table 1), and 5-task multitask learning (MTL-5; last column in Table 1) using the two probing methods, attention probes and attended-value probes (Section 5.1). For each model, the head with the highest probing accuracy among 144 heads (12 heads per layer, 12 layers) is selected per label. Since every model is developed 3 times using different random seeds for better generalization (Section 3.4), 3 heads are selected per label, which are averaged to get the final probing score for that label. The full probing results with respect to all labels are described in Appendix A.3. Two important observations are found from these experiments. (1) Even without fine-tuning, certain heads perform remarkably well on particular labels, confirming the existence of stem cells (Sec. 5.2.1).
(2) Most heads show higher performance once finetuned; nonetheless, MTL does not always enhance them for all tasks. In fact, MTL models show improvement on only a few labels (Sec. 5.2.2), while MTL-5 models show no benefit for most labels. 2

Pluripotent Stem Cells
The probing results of pre-trained attention heads from BERT (before fine-tuning) are visualized to verify the existence and pluripotency of stem cells. These results show very high accuracies for many labels, confirming the existence of stem cells. As they reside in the same pre-trained model, their pluripotency is therefore implied. Specifically, the number of probing tasks is 203 (POS:49, NER:19, SRL:67, DEP:45, CON:23 as shown in Appendix A.3.3) which is larger than the number of heads (144) in BERT-base by itself. Not all of them provide task specific knowledge as shown in our pruning experiments (Section 4), so the number of utilized heads is even smaller. As a result, some heads must play multiple roles in different tasks.
Dependency Parsing For DEP, probing results from the best performing heads with respect to their layers for all labels are plotted in Figure 2, some of which are even comparable to supervised results. The best performing head of BERT finds the ROOT of a sentence with a 96.25% accuracy without any supervision, demonstrating its ability to convey the concept. Furthermore, the identification of ROOT happens mostly at the early stage of inference, i.e. in layers 2 and 3. This finding may conflict with the idea of syntactic features getting learned in middle layers (Jawahar et al., 2019). It takes the argument from Tenney et al. (2019) a step further suggesting that syntax can be encoded in early layers of TEs.
Semantic Role Labeling As shown in Figure 3, probing shows promising results on many semantic roles. Specifically, numbered arguments (ARG0-4) are recognized in layers 5 to 7, while modifiers are identified in layers 8 to 10 with > 80% accuracies, , and ARGM-COM (comitative). Unlike DEP that most labels are learned within the first 7 layers, SRL requires 7+ layers to be learned such that no role reaches the peak before layer 5. This implies that semantic roles take more efforts to be learned than syntactic dependencies.

Stem Cells Specialization
Though stem cells are pluripotent, they develop into specialized ones in STL and lose specialities in MTL according to the following comparisons of best performing heads across BERT, STL and MTL models. Figure 4 compares the heads in the STL model against the other models; the y-axis shows the probing results from the model in the x-axis subtracted by the results of the STL model. Labels (sorted by frequency) with negative scores for BERT imply that STL performs better on those labels than BERT (without getting finetuned), whereas negative labels with the other models (e.g., NER, DEP) imply that the joint models perform worse than the STL model on those labels. For POS, MTL degrades performance for most labels compared to STL. Even without getting finetuned, the pre-trained BERT model performs very well on punctuation labels, which is expected. The performance on WP$ (possessive wh-pronoun) is significantly improved with NER, DEP, and SRL as a possessive wh-pronoun (e.g., whose) often follows a name or is used in a relative clause that plays an important role in DEP and SRL.

BERT NER
Named Entity Recognition For NER, BERT detects PERCENT, MONEY, LAW, LANGUAGE, NORP (national|religious|political groups) and PRODUCT with over 90% probing accuracy, probably due to the rich set of those entities present in pre-training data. Although most joint models degrade probing results for nearly every entity type, POS and DEP improve upon more entity types than the other tasks ( Figure 5), which is consistent with the results illustrated in Table 1. Constituency Parsing As shown in Figure 6, POS improves the most number of constituent types but also causes the largest drop among the MTL models for the most frequent type, NP (noun phrase). This contributes to related constituent types such as RB (adverb) in ADVP (RB phrase), WRB (wh-adverb) in WHADVP (WRB phrase), and UH (interjection) in INTJ (UH phrase). Its dramatic decrease on the NP performance might be due to the internal lexical complexity in NP. Regarding NER, its boost on ADVP can be due to temporal entities (TIME, DATE) nested within ADVP such as (ADVP (NP one year) ago), where one year ago is a DATE entity. As for PP (preposition phrase), it usually follows the induction rule of PP → IN/TO + NP, where the NP is often an named entity (e.g., (PP (TO to) (NP Mary))). Regarding DEP, it mainly improves wh-phrases like WHNP and WHADVP, which correspond to nsubj and advmod dependency relations, respectively. Regarding SRL, it slightly improves NML (nominal modifiers) and FRAG (fragment), which may be ascribed to the strength of the span-based SRL not requiring constituency structures for decoding. Note that in these probing analyses, we selected the best performing heads from each model independently as their locations are not regular anymore. Without the pruning objective (Equation 1), the locations of the best performing heads are nonregular possibly due to the knowledge transfer between stem cells and non-stem cells. Thus, knowledge transferring from stem cells to non-stem cells becomes much easier when the models are free to use as many heads as they want. In fact, when finetuned without the pruning objective, many stem cell attention heads transfer their knowledge to non-stem cell heads to get specialized. It is a phe-nomenon frequently observed in many previous works (Tenney et al., 2019;Liu et al., 2019a;Jawahar et al., 2019) and this work (Section 5.2.2) that certain layers achieve best performance of certain tasks. Given that the stem cells of BERT are mostly in middle layers (Section 4.4), we believe that the best performing layers or heads in lower or higher layers are the results of transfer learning on stem cells. In reality, a stem cell also moves from its original area (e.g., bone marrow) to another area (e.g., bone surface) to get specialized.

Conclusion
This study analyzes interference on the 5 core tasks by highlighting naturally talented attention heads, whose importance turns out to be invariant for many downstream tasks. The Stem Cell Hypothesis states that these talented heads are like stem cells that can develop into experts but not all-rounders. Our hypothesis is validated by several novel parameterfree probes, revealing the interfered representations of stem cells. We will adapt this work to more tasks and languages for broader generality in the future.

A Appendix
A.1 Corpus Statistics

A.2 Hyper-Parameter Configuration
The hyper-parameters used in our models are described in Table 6.   Compared to STL, a similar trending with POS and NER can be observed that MTL improves only certain relations as shown in Figure 7. Among the 4 tasks, POS improves the least tags in terms of both overall accuracy and probing accuracy. It improves dobj (direct object) and expl (expletive) possibly due to that its decoder needs to assign a verb tag to the ROOT verb and EX (existential there) to "there", enhancing the representations of these two. Regarding NER, it mainly improves modifiers that modify nouns which comprises named entities. In the case of CON, modifiers and complement arguments are improved, most of which usually reside in NP (Noun Phrase) or VP (Verb Phrase) phrases, placing upper bounds on the distance of dependencies. As regards SRL, it improves subjects and clausal relations. In comparison to STL illustrated in Figure 8, POS and CON mainly improves ARG0-3 (agent, patient, instrument, benefactive, attribute and starting point) and some modifiers including ARGM-TMP (temporal), ARGM-CAU (cause), ARGM-PRD (secondary predication), ARGM-EXT (extent), ARGM-PNC (purpose) and ARGM-REC (reciprocals). Both POS and CON reveal syntactic functions which appear to coordinate attentions on semantic roles in a similar way. Regarding NER and DEP, they improve arguments that include referent or pronouns (R-ARG1, R-ARG0) and modifiers (ARGM-NEG which is negation, ARGM-CAU, ARGM-PRD, ARGM-EXT), possibly due to the biaffine decoders they employ analogously enhance the heads.

A.4 Results for Other Transformers
We also applied our pruning and probing methods on 3 recent transformer encoders, RoBERTa (Liu et al., 2019c), ELECTRA (Clark et al., 2020) and DeBERTa (He et al., 2020), to further demonstrate the generality of our hypothesis. For all of them, we use the base size version which has 12 layers with 144 attention heads in total. Although we did not tune hyper-parameters specifically for any of them and re-used the same hyper-parameters of BERT, their results turned out to be as interesting as BERT results. Their STL and MTL results are shown in Table 12. Unsurprisingly, MTL-5 is outperformed by its single-task counterparts for all tasks and for all transformer encoders, raising the dilemma behind transformer-based MTL.
Their pruning results are shown in Table 13. Although the results could be better tuned for each transformer encoder, our DP strategy is still able to prune roughly 50% heads while keeping comparable performance.
Their visualization of head utilization is illustrated in Figure 9, 11 and 13. Similar patterns among each transformer encoder can be observed, supporting our claim that the MTL-DP model reuses a very similar set of heads used by the STL-DP models.
Their probing results are illustrated in Figure 10, 12 and 14, which also align with our findings. Specifically, the DEP probing results on transformer encoders are already very high even without fine-tuning on actual dependency treebanks (Figure 10f, 12f and 14f Table 13: Results of single-task learning (STL), STL with static pruning (STL-SP) and multi-task learning on the 5 tasks with/without dynamic pruning (MTL/MTL-DP). The STL Performance column is equivalent to the main diagonal in Table 12.