GradTS: A Gradient-Based Automatic Auxiliary Task Selection Method Based on Transformer Networks

A key problem in multi-task learning (MTL) research is how to select high-quality auxiliary tasks automatically. This paper presents GradTS, an automatic auxiliary task selection method based on gradient calculation in Transformer-based models. Compared to AUTOSEM, a strong baseline method, GradTS improves the performance of MT-DNN with a bert-base-cased backend model, from 0.33% to 17.93% on 8 natural language understanding (NLU) tasks in the GLUE benchmarks. GradTS is also time-saving since (1) its gradient calculations are based on single-task experiments and (2) the gradients are re-used without additional experiments when the candidate task set changes. On the 8 GLUE classification tasks, for example, GradTS costs on average 21.32% less time than AUTOSEM with comparable GPU consumption. Further, we show the robustness of GradTS across various task settings and model selections, e.g. mixed objectives among candidate tasks. The efficiency and efficacy of GradTS in these case studies illustrate its general applicability in MTL research without requiring manual task filtering or costly parameter tuning.


Introduction
MTL (Caruana, 1997) is widely used in NLU research to improve the performance of machine learning (ML) models by enlarging the training data size with datapoints related to the primary tasks. However, its efficacy is largely affected by the selection of auxiliary tasks. The auxiliary task selection problem is addressed mainly under two settings. The first setting treats each task as a whole. For example, Bingel and Søgaard (2017) assess task relatedness by exhaustive experiments in all † Work done when interning at the Minds, Machines, and Society Lab at Dartmouth College. task pairs. Nonetheless, high pairwise task correlations are often not decisive features for choosing auxiliary tasks. Glover and Hokamp (2019) train a policy for task selection through counterfactual estimation, but their learned policy brings improvements only to one out of nine tasks on GLUE benchmarks (Wang et al., 2019). The second setting subsamples training instances from auxiliary tasks, e.g. with Bayesian optimization (Ruder and Plank, 2017), but these methods are time-and resource-consuming due to their reliance on multitask experiments involving all the candidate tasks. AUTOSEM (Guo et al., 2019) combines the two settings into one method, selecting candidate tasks with Thompson sampling and deciding the ratio with which to draw training instances from the selected tasks via a Gaussian Process. Despite the higher quality of the auxiliary task sets it generates, AUTOSEM is still costly, similar to Ruder and Plank (2017).
To design a better-performing and less costly auxiliary task selection method, we take advantage of the characteristics of Transformer networks (Vaswani et al., 2017). Prior research reveals that in a Transformer-based model, each attention head attends on specialized linguistic features (Clark et al., 2019;Voita et al., 2019;Mareček and Rosa, 2019;Lin et al., 2019;Vig and Belinkov, 2019;Kovaleva et al., 2019). Since important linguistic features strongly correlate with the goals of tasks, we further hypothesize that a good auxiliary task shares key linguistic features with the primary task. Thus, we address the auxiliary task selection problem by maximizing the overlap of important heads in a Transformer-based model between primary and auxiliary tasks. As Michel et al. (2019) claim, the importance of attention heads to a task can be approximated by the absolute gradients accumulated at each head. We design our auxiliary task selec-tion method, GradTS, accordingly, by ranking the importance of attention heads for each individual task and modeling the correlation between each pair of tasks with their head rankings. By greedily selecting the tasks most closely related to the primary task, GradTS constructs auxiliary task sets through trial experiments (GradTS-trial). GradTS also enables task subsampling to further optimize auxiliary task sets. To achieve this goal, we design another setting of GradTS (GradTS-fg) that first assesses the correlations between the primary task and each training instance in an auxiliary task selected by GradTS-trial and then filters the training instances via thresholding.
We assess the strength of GradTS via MTL evaluations on 8 GLUE classification tasks. We use AUTOSEM and AUTOSEM-p1 1 as our baselines since AUTOSEM is among the most advanced auxiliary task selection methods in the NLP field and because it features both task selection and task subsampling. For consistency, we use the bert-basecased model as the backend model of GradTS, AU-TOSEM, and the MTL framework. Results show that GradTS-trial produces better auxiliary task sets than AUTOSEM-p1 in all 8 GLUE tasks while costing on average 6.73% less time. In experiments with task subsampling, GradTS-fg again shows superior strength to AUTOSEM on all 8 tasks while costing 21.32% less time. These results strongly support the efficacy and efficiency of GradTS.
In addition to the main experiments, we compare GradTS to multiple intuitive auxiliary task selections to show its high performance. We also conduct case studies to show that GradTS is effective and robust on difficult tasks or candidate tasks with mixed objectives. These findings reflect the general applicability of GradTS in various task settings. In comparison, auxiliary task sets produced by AUTOSEM and AUTOSEM-p1 are often not optimal in these complicated scenes. Further, GradTS reuses the head rankings when the candidate task set grows larger, which makes it even more timeand resource-efficient than existing methods.
The contributions of this paper are three-fold: • we propose GradTS, an automatic auxiliary task selection method based on gradient calculation in pre-trained Transformer-based models; • we illustrate the efficacy and efficiency of GradTS through comprehensive MTL evalua- 1 We refer to the AUTOSEM method without task subsampling as AUTOSEM-p1.
tions; and • we show, through case studies, the superior capability and robustness of GradTS to complicated candidate task settings compared to both AU-TOSEM and auxiliary task selections based on human intuition.  (Tjong Kim Sang and Buchholz, 2000). The official data split of all these datasets is applied. Additionally, we introduce MELD and Dyadic-MELD datasets (Poria et al., 2019) to verify the applicability of GradTS to tasks that are difficult for its backend model. While these two tasks are multimodal emotion recognition tasks, we use only the textual data in the experiments. The MELD and Dyadic-MELD datasets are annotated with 7 emotion labels. The bert-base-cased model achieves F-1 scores less than 50% on both tasks, lower than its performance on most GLUE classification tasks.
Details of the datasets are displayed in Table  1. We evaluate both accuracy and F-1 scores for MRPC and QQP, accuracy for QNLI, RTE, SST-2, MNLI 4 and WNLI, Matthew's correlation coefficient (MCC) for CoLA, Pearson's correlation coefficient and Spearman's correlation coefficient for STSB, and F-1 score for POS, NER, SC, MELD, and Dyadic-MELD tasks.

Methodology
We design GradTS based on the hypothesis that better auxiliary tasks share more important linguistic features with the primary task. Since each attention head in a Transformer-based model functions similarly as a standalone feature extractor on a specialized set of features, we approximate the important feature set of each task by the heads contributing the most to the task. As the key feature sets are task-specific, GradTS does not require multi-task experiments to rank auxiliary tasks given a primary task. This makes GradTS a time-and resourceeconomic method especially when the set of candidate auxiliary tasks is large or growing.
GradTS consists of three successive modules responsible for (1) ranking attention heads for a task based on their contributions, (2) ranking auxiliary tasks based on inter-task correlations, and (3) finalizing the auxiliary task sets, respectively.

Attention Head Ranking Module
We estimate the importance of attention heads to a task using the absolute gradients accumulated at each head, following Michel et al. (2019). Specifically, we achieve the goal in four steps: (1) We fine-tune a pre-trained Transformer-based model on a task. (2) We repeat the fine-tuning step on the training set of the task with the fine-tuned model, without updating parameters, to get gradients of the model. (3) We sum up the absolute gradients accumulated at each attention head during the last fine-tuning step. (4) We layer-wise normalize the accumulated gradients and scale the gradients to the range [0, 1] globally to represent the importance of each head for the given task.
In practice, we use pre-trained bert-base-cased model as the backend of GradTS and we fine-tune the model for three epochs before starting to accumulate gradients on each head 5 . This fine-tuning stage is designed to avoid large gradients on unimportant heads when the model is exposed to a downstream task for the first time.

Auxiliary Task Ranking Module
Given a primary task, we rank each candidate auxiliary task by the correlation between its head ranking matrix and that of the primary task. As Puth et al. (2015) suggest, we use Kendall's rank correlation coefficients (Kendall's τ ) since the importance scores of heads seldom result in a tie, based on our observations. We visualize the head importance matrix of the bert-base-cased model on MRPC and task correlations for the 8 GLUE classification tasks in Figures 3 and 4 in Appendix.
While the rankings of auxiliary tasks produced by GradTS are intuitive in some cases, e.g. the three natural language inference (NLI) tasks are good auxiliary tasks for each other, the correlation scores between many seemingly unrelated tasks, e.g. WNLI and CoLA, are also high. This reveals the difficulty of manually designing auxiliary task sets since the factors affecting the appropriateness of auxiliary tasks are multi-faceted, e.g. text lengths and label distributions. As a result, designing automatic methods for selecting auxiliary tasks makes up a crucial part of MTL research, especially at a time when candidate auxiliary tasks are rapidly growing both larger in amount and more complex.

Auxiliary Task Selection Module
After obtaining the rankings of candidate auxiliary tasks for each primary task, we finalize the auxiliary task selection process through trial experiments. We also study the potential of GradTS to subsample the selected auxiliary tasks. Our experiments show that with one additional fine-tuning pass of its backend model on the individual tasks, GradTS produces subsampled auxiliary training sets higher in quality than the task-level selections.
We introduce the two settings of GradTS to select tasks from the task correlations as follows: [Task-level Trial-based] We select auxiliary tasks greedily under this setting. Starting from the most closely-correlated task to a primary task, we keep adding tasks to the auxiliary task set and run MTL evaluations on the primary task and all the chosen auxiliary tasks. GradTS stops adding new tasks when the evaluation score starts to decrease on the validation set we leave for parameter tuning and finalizes the auxiliary task set with the tasks chosen at the previous step.
[ Instance-level] We re-run the base model of GradTS on all the individual tasks once, with gradient calculation but not parameter updates. For each instance, we take the absolute value of its gradients on all the attention heads, layer-wise normalize the gradients, and scale the numbers to the range [0, 1]. Then we calculate and record the correlation score between the normalized gradient matrix and the head ranking matrix of each candidate auxiliary task. Last, we use a threshold to select auxiliary training instances from tasks chosen by the tasklevel trial-based method to form a subsampled auxiliary task set. The threshold we use in this paper is tuned by experiments on RTE, MRPC, and CoLA tasks, which is a Kendall's τ of 0.42.
We refer to GradTS with the task-level trialbased and instance-level task selection settings as GradTS-trial and GradTS-fg, respectively.

Experimental Settings
To show the strength of GradTS, we run evaluations with MT-DNN (Liu et al., 2019a) as the MTL evaluation framework on 8 classification tasks in GLUE benchmarks. The bert-base-cased model is used as the backend of MT-DNN and all the auxiliary task selection methods. For tasks whose input contains multiple sentences, we concatenate the sentences together with a [SEP] token in between. We use the Huggingface (Wolf et al., 2020) implementation of BERT (Devlin et al., 2019) and other pre-trained models in this paper. In each experiment, we finetune MT-DNN for 7 epochs with a learning rate of 5e-5 and report the highest score 6 . 6 We apply the same set of hyper-parameters in all the experiments for fair comparison. We also use official dataset splits to minimize randomness in all our experiments.
Figure 1: Task selection results by two AUTOSEM and two GradTS methods on 8 GLUE classification tasks. Y and X axes represent primary and auxiliary tasks, respectively. Darker color in a cell indicates that a larger portion of an auxiliary task is selected.

Auxiliary Task Selection Results
Figure 1 shows the auxiliary task sets selected by AUTOSEM-p1, AUTOSEM, GradTS-trial, and GradTS-fg methods. Each auxiliary task is labeled as 1 (selected) or 0 (not selected) for methods under the task-level auxiliary task selection setting (AUTOSEM-p1 and GradTS-trial). The percentage of selected training data amount in each auxiliary task is reflected for AUTOSEM and GradTS-fg. While some common task combinations appear in the auxiliary task sets constructed by both GradTS-trial and AUTOSEM-p1, e.g. CoLA-WNLI and QNLI-MNLI, the two methods generally make very different selections. We note that GradTS-trial usually generates larger auxiliary task sets than AUTOSEM-p1 on tasks with small training data size, e.g. WNLI, RTE, and MRPC. Different from AUTOSEM-p1 which balances exploitation with exploration at the task selection phase, the auxiliary task ranking mechanism of GradTS-trial is in full charge of controlling the risk of selecting improper auxiliary tasks. The task selection module of GradTS-trial greedily chooses auxiliary tasks based on the task rankings and it is thus more likely to also select auxiliary tasks marginally improving the performance of the primary task than AUTOSEM-p1. There are more disagreements between the task selection ratios of GradTS-fg and AUTOSEM than the task-level selections. For example, while WNLI is constantly discarded by AU-TOSEM at its second phase due to the small size of WNLI, GradTS-fg ranks WNLI highly for three primary tasks (CoLA, MRPC, and RTE). Benefit-

Methods
Primary Tasks  CoLA  MRPC  MNLI  QNLI  QQP  RTE SST-2    ing from its training instance ranking mechanism which treats each record independently, GradTS-fg is robust to the higher overall impact of a few noisy instances in smaller datasets. As such, GradTS has a lower chance of underestimating the importance of small auxiliary datasets than AUTOSEM. While some auxiliary task selection results are intuitive, they are mostly beyond the scope of manual designs. For example, QQP is not chosen by either AUTOSEM or GradTS as a good auxiliary task for CoLA or MRPC, despite its large size. It is also counter-intuitive that GradTS does not select MNLI or QNLI into the auxiliary task set of WNLI though these tasks share similar goals. Due to the gap between the automatic auxiliary task selection results and human intuitions, we assess the strength of these task selection methods via MTL evaluations and show the results in Table 2.

MTL Evaluation Results
While MTL is designed to enhance model performance, our evaluations reveal that simply using all the available auxiliary tasks without selection is not sufficient. Despite the enlarged training dataset, MTL with all the candidate auxiliary tasks brings only marginal improvements to 3 out of the 8 GLUE classification tasks. On the con-trary, the MTL performance is generally higher than single-task evaluation scores when an auxiliary task selection method is applied. We attribute this phenomenon to the greater discrepancies in some primary-auxiliary task combinations without carefully selecting auxiliary tasks. These results show that while MTL provides a promising way to boost the performance of ML models, a good automatic auxiliary task selection method is necessary. Between the two task-level auxiliary task selection methods, GradTS-trial produces better auxiliary task sets than AUTOSEM-p1 for all the 8 primary tasks. MTL performance with GradTStrial also beats the single-task baseline in all the evaluations, while AUTOSEM-p1 produces lowquality auxiliary task sets on tasks whose training sets are extremely large (MNLI) or small (WNLI) compared to the other tasks. This demonstrates that GradTS-trial is more robust to the design of candidate auxiliary task sets than AUTOSEM-p1. Though the auxiliary task sets selected by AUTOSEM-p1 and GradTS-trial overlap a lot for CoLA, MRPC and RTE, no training instance is drawn from WNLI in its second phase, resulting in large performance gaps between AUTOSEM and GradTS-fg on these tasks. For comparison, GradTS-fg samples 59.94%, 70.98%, and 60.25% of the WNLI dataset, respectively, for CoLA, MRPC, and RTE, and achieves 3.79%, 17.93%, and 13.30% higher MTL evaluation scores than AUTOSEM on these tasks. Despite the generally higher fragility of small datasets to noisy annotations, these datasets may contain useful datapoints as auxiliary training instances and should not be completely ignored. GradTS-fg subsamples tasks on the instance level, which is more efficient and flexible in picking highly-correlated training instances than the second phase of AUTOSEM.  Table 4: MTL evaluation results with AUTOSEM and GradTS auxiliary task selection methods on 10 classification tasks. Single-Task indicates single-task performance of bert-base-cased and NO-SEL indicates performance of MT-DNN, with the bert-base-cased backend, trained on all 10 tasks. D-MELD refers to Dyadic-MELD.

Running Time Analysis
As Table 3 shows, the average GPU usage is comparable for all the four auxiliary task selection methods in the main experiments. All the experiments are run with a batch size of 32 on an NVIDIA RTX-8000 graphics card. Among the four methods, GradTS-trial is the most time-efficient mainly because its task rankings are generated from single-task experiments and they are fixed for all the evaluations. While GradTS-fg filters training instances based on the output of GradTS-trial, the additional time cost is only linearly correlated with the training data size of auxiliary tasks. On average, GradTS-fg takes longer time to finish than AUTOSEM-p1 but is more efficient than AUTOSEM. Since GradTS reuses the task-specific head importance matrices and the thresholds for subsampling auxiliary tasks, it becomes gradually more time-economic than AU-TOSEM and AUTOSEM-p1 when the candidate task set is larger or growing. Thus, GradTS is a superior choice to AUTOSEM on large and complex task sets in terms of both efficacy and efficiency.

Discussions
GradTS is shown to be effective on 8 classification NLU tasks in our main experiments. In this section, we conduct case studies to (1) explore whether GradTS is effective on tasks that are difficult or have different training objectives, (2) validate that GradTS selects better auxiliary task sets than human intuition, and (3) justify our use of bert-basecased as the backend model of GradTS and the MTL evaluation framework.

Task Selection with Difficult Tasks
GradTS relies on the hypothesis that the amount of gradients distributed on each attention head reflects the important linguistic features for a task. However, tasks that are difficult for a model introduce more noise to its gradient calculations and thus may have negative effects on GradTS. To study the effect of difficult tasks, we evaluate GradTS on a task set containing the 8 GLUE classification tasks and two MELD tasks. The MELD and Dyadic-MELD tasks are difficult for the bert-base-cased model as the single-task performance on these tasks are both below 50 in F-1 scores.
We note that the largest tasks in size, i.e. MNLI and QQP, are not chosen as auxiliary tasks for either MELD or Dyadic-MELD, suggesting that training data amount is not a decisive factor for auxiliary task selection. As auxiliary tasks, MELD is selected for SST-2 and CoLA, and Dyadic-MELD for SST-2 and RTE. The connection between SST-2 and the two MELD tasks is intuitive since emo-  tional and sentiment features are interconnected, while the other selections are not as intuitive.
We show the evaluation scores in Table 4. Compared to Table 2, we note that the performance of AUTOSEM-p1 is largely harmed when MRPC, MNLI, and WNLI are set up as primary tasks, while AUTOSEM performance also suffers on QQP. On the contrary, GradTS-trial performs relatively stably and GradTS-fg frequently produces auxiliary task sets higher in quality on the enlarged candidate task sets than on the 8 GLUE classification tasks only. We attribute the strength of GradTSfg to its ability to discard noisy training instances and mainly select datapoints contributing to the primary tasks. When MELD and Dyadic-MELD are primary tasks, MTL performance, either with or without auxiliary task selection, is generally higher than the single-task baseline. These results indicate the importance of MTL research and highlight the study of good auxiliary task selection methods, especially on tasks that are difficult under the singletask setting. We also note that while AUTOSEM-p1 is not able to generate high-quality auxiliary task sets for MELD, the successive data subsampling mechanism in AUTOSEM polishes the data selection and improves the MTL performance by 2.63 in F-1 score. Similarly, GradTS-fg generates better auxiliary task sets than GradTS-trial in all the evaluations, revealing the necessity of filtering out noisy auxiliary training instances. To conclude, while both GradTS-trial and GradTS-fg are robust to difficult tasks in the candidate task sets, GradTS-fg is, in general, more optimal in these scenes.

Task Selection with Mixed Objectives
Including AUTOSEM, most prior publications on MTL consider only auxiliary tasks with the same training objective as the primary task. This overly simplifies the auxiliary task selection problem and limits the scope of research on the topic. In this section, we examine the applicability of GradTS to candidate task sets with mixed objectives. As candidate tasks, we use the 8 GLUE classification tasks, a regression task (STSB), and three sequence labeling tasks (POS, NER, and SC). The auxiliary task selection results by GradTS are shown in Figure 2, which make intuitive sense in some cases, e.g. POS and SC are closely bond to CoLA and STSB is selected as an auxiliary task for MRPC.
We assess the quality of auxiliary task sets produced by GradTS via evaluations with MT-DNN and display the results in Table 5. Results show that the performance of GradTS does not suffer from introducing the four non-classification tasks, as the auxiliary task sets selected by GradTS in most cases lead to higher MTL performance than in Table 2. In comparison, the auxiliary task sets produced by AUTOSEM-p1 and AUTOSEM are noisier with the four newly-introduced tasks, causing noticeable performance drops to 3 and 2 GLUE classification tasks, respectively. Furthermore, while both GradTS-trial and GradTSfg lead to higher MTL performance than not applying any auxiliary task selection method, applying AUTOSEM-p1 and AUTOSEM causes performance drops in 4 and 3 tasks, respectively. AUTOSEM-p1 and AUTOSEM even cause the

Methods
Primary Tasks  CoLA  MRPC  MNLI  QNLI  QQP  RTE SST-2    MTL performance to drop below the single-task evaluation scores in 3 and 4 experiments, respectively. The results indicate that, despite the potentially increased discrepancies among tasks with various objectives, GradTS is an effective and robust auxiliary task selection method. We also note that since GradTS reuses the head ranking matrices produced in the main experiments, its additional time cost on the enlarged task set is negligible, compared to AUTOSEM which has to be fully re-run. This further demonstrates the efficiency of GradTS, especially when the candidate task set grows larger.

Comparison to Intuitive Task Selections
We further validate the strength of GradTS by comparing the MTL performance with GradTS to that with three intuitive task selection methods based on simple dataset analysis. The three heuristics we set up for comparison choose auxiliary tasks based on (1) training data size; similarity between the primary and auxiliary tasks with respect to (2) task type and (3) average sentence length. Table 7 displays the training data amount, task type, and average sentence length of the 8 tasks. For HEU-Size and HEU-Len, starting from the most appropriate auxiliary task, we keep adding tasks into the auxiliary task set greedily and report the best score. According to Table 6, while the intuitive task selections usually result in higher performance than the single-task evaluation scores (and comparable to AUTOSEM results shown in Table 2), the GradTS methods outperform the intuitive methods on all the 8 tasks. Among the three intuitive task selection methods, HEU-Type in most cases produces the auxiliary task sets highest in quality. This demonstrates the high priority of auxiliary tasks with similar goals as the primary task at the task selection phase. While the importance of task types is reflected in the task selection results of GradTS ( Figure 1) as well, GradTS is able to take other empirical clues into consideration and construct more effective auxiliary task sets. These additional clues, however, are expensive to design and cannot be directly transferred to other candidate task sets without costly adaptations if manual auxiliary task selection methods are applied. Moreover, the three simple heuristics are not always applicable when the candidate task set becomes complex, e.g. containing tasks with varying label sets or with multiple objectives. GradTS, on the contrary, has shown great capability and robustness in these complex cases with moderate time and resource cost. It is a promising method in place of expensive manual auxiliary task set design in MTL research.

Base Model Selection for GradTS
Since GradTS is built on Transformer-based models, we select its backend model from 6 common pre-trained Transformer-based models, namely bert-base-uncased, bert-base-cased, bert-  large-uncased, bert-large-cased, roberta-base, and roberta-large. We set up MTL evaluations with MT-DNN on CoLA, MRPC, SST-2, WNLI, QNLI, and RTE to examine the appropriateness of these backend models in GradTS. Specifically, we assess the strength and robustness of these models by comparing the MT-DNN performance with auxiliary task sets selected by GradTS against that without auxiliary task filtering. The same backend model is used for GradTS and MT-DNN in each experiment to eliminate possible discrepancies across models.
From Table 8, we notice clear performance gaps between the cased and uncased models as the backend of GradTS. For example, on CoLA and SST-2, GradTS-trial produces worse auxiliary task sets than using all candidate tasks with a bert-baseuncased backend, while GradTS with a bert-basecased backend improves model performance for both GradTS-trial and GradTS-fg. This is intuitive since case information is crucial for grammaticality and sentiment tasks. Among the four cased backend models, RoBERTa (Liu et al., 2019b) does not trigger larger MT-DNN performance improvements than BERT of the same size, implying that larger pre-training corpora do not greatly affect the efficacy of GradTS. While the performance improvement brought by GradTS with the bert-large-cased backend is comparable to that with the bert-basecased backend, its running time and GPU cost are over 100% higher. We thus choose bert-base-cased as the backend of GradTS to balance performance with resource cost, though potentially any cased Transformer-based model is a valid choice.

Conclusion and Future Work
This paper presented GradTS, an automatic auxiliary task selection method for MTL based on pretrained Transformer-based models. On 8 GLUE classification tasks, GradTS produced auxiliary task sets higher in quality than AUTOSEM, a strong baseline method, with less time and resource consumption. In our case studies comparing GradTS to intuitive task selections, GradTS showed greater capability of finding more optimal auxiliary task sets than using trivial heuristics based on statistics of datasets. We additionally demonstrated that GradTS was both more effective and more efficient than our baseline on task sets with mixed objectives. GradTS was also shown to be robust in task sets containing difficult tasks for its backend model. These findings support the applicability of GradTS to a wide range of task and model settings. Future work may extend the use of GradTS to determine high-quality auxliary datasets for the same task. We show the head importance matrix of the bertbase-cased model on MRPC in Figure 3. The task correlation matrix generated by bert-base-cased on the 8 GLUE classification tasks is shown in Figure  4.  Table 9: MTL evaluation results of the three settings of GradTS on 8 GLUE classification tasks. The highest score for each task is in bold.

B Threshold-based Task Selection
Besides the two settings of GradTS (GradTStrial and GradTS-fg) for the auxiliary task selection module, we additionally introduce a task-level  threshold-based setting. Under this setting, we empirically choose a threshold with which GradTS produces the best auxiliary task sets in a small collection of tasks. Then for each primary task, GradTS selects all the tasks having correlation scores above the threshold into the auxiliary task set. We use Kendall's τ of 0.47 as the threshold in the evaluations, which is tuned via experiments on RTE, MRPC, and CoLA tasks. We refer to GradTS with the task-level threshold-based setting as GradTS-thres. It is worth noting that though GradTS-thres is more time-and resource-economic than GradTStrial and GradTS-fg, its performance is the lowest in most cases. We display the evaluation results of GradTS-thres, GradTS-trial, and GradTS-fg on the 8 GLUE classification tasks in Table 9 and the average time and memory cost in Table 10. Since GradTS-thres is not optimal compared to GradTStrial and GradTS-fg, and sometimes outperformed by AUTOSEM, we do not introduce this setting in the main body of our paper. However, it is still an interesting method to study regarding its time-and resource-efficiency and stronger capability than not applying any auxiliary task selection method.