Task Compass: Scaling Multi-task Pre-training with Task Prefix

Leveraging task-aware annotated data as supervised signals to assist with self-supervised learning on large-scale unlabeled data has become a new trend in pre-training language models. Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks. To tackle the challenge, we propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks. We conduct extensive experiments on 40 datasets, which show that our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships. The task relationships reflected by the prefixes align transfer learning performance between tasks. They also suggest directions for data augmentation with complementary tasks, which help our model achieve human-parity results on commonsense reasoning leaderboards. Code is available at https://github.com/cooelf/CompassMTL


Introduction
Recent years have witnessed a growing interest in leveraging a unified pre-trained language model (PrLM) to solve a wide range of natural language processing tasks (Tay et al., 2022;Chowdhery et al., 2022;Xie et al., 2022;Zhang et al., 2022).The pre-training recipe of a PrLM is driving from self-supervised learning (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Lan et al., 2020;Clark et al., 2020) to multi-task learning (MTL) with a mixture of standard selfsupervised tasks and various supervised tasks, [dream], class, making, you, with, worth, effort, the, woman [sciq]: A wetland is an area that is wet for all or part of the year.Wetlands are home to certain types of plants.
What is an area of land called that is wet for all or part of the year? 1) tundra, 2) "plains", 3) "grassland", 4) "wetland" [MASK]: M: I am considering dropping my dancing [MASK].I am not [MASK] any progress.","W: If I were [MASK], I stick [MASK] it.It's definitely [MASK] time and [MASK].What does the man suggest [MASK] [MASK] do?

MTL MLM
Figure 1: Input-output view.We append a task prefix for each data sequence to capture common patterns from the dataset and require the model to predict some randomly masked prefixes to capture task differences.
which takes advantage of learning from both largescale unlabeled corpus and high-quality humanlabeled datasets (Raffel et al., 2019;Aribandi et al., 2021). 1 Benefitting from supervision from related tasks, MTL approaches reduce the cost of curating deep learning models for an individual task and provide a shared representation that is generally applicable for a range of tasks (Wu et al., 2020b).
In the research line of multi-task learning for PrLMs, a typical solution is to cast all tasks into a text-to-text format and utilize an encoder-decoder PrLM such as T5 to predict the target sequences (Raffel et al., 2019;Aribandi et al., 2021).Despite the extensive efforts on leveraging supervised tasks in strengthening PrLMs, the latest trend is extreme scaling of task numbers, with little attention paid to the relationships between tasks (Sanh et al., 2021;Wei et al., 2021).Aribandi et al. (2021) investigated co-training transfer effects amongst task-families and empirically found that tasks in different families may have side effects between each other, e.g., summarization tasks generally seem to hurt performance on other task families such as dialogue system (Mehri et al., 2020), natural language inference (Bowman et al., 2015), and commonsense reasoning (Lourie et al., 2021).
When the task number scales up, the training of PrLMs would be more vulnerable to negative transfer due to the severe inconsistency of domain and data distribution between tasks (Wu et al., 2020b;Padmakumar et al., 2022).As one of the key concepts underlying MTL, task relationships potentially provide an effective basis for employing PrLMs in a more effective and interpretable way.
To handle the issue of negative transfer during multi-task learning, early studies have taken task relationships into account by employing a dualprocess model architecture that is composed of a shared encoder and task-specific layers.The two parts are supposed to integrate the common features of all the learning tasks and explore the task relationship in a predefined manner (Zheng et al., 2019;Liu et al., 2019a;Bai et al., 2020;Ma et al., 2021), respectively.However, these methods require additional modifications to model architecture and increase the model complexity and computation cost.Therefore, they are suboptimal for applying to PrLMs in terms of generality and computational bottlenecks.
All the considerations above lay down our goal to investigate simple yet effective ways to measure the task relationship without additional cost and keep the generality of PrLMs.In this work, we propose a prefix-guided multi-task learning framework (CompassMTL) to explore the mutual effects between tasks (Figure 1) and improve model performance with complementary tasks.Targeting natural language understanding (NLU) tasks, we employ a discriminative PrLM2 as the backbone model and train the model on 40 tasks.Experimental results show that our model achieves human-parity performance on commonsense reasoning tasks.We further probe into the task relationship entailed in the tasks prefix representations, finding that the measured relationship highly correlates with task-to-task transfer performance, and it is also of referenced value for optimizing the PrLM on a target task with its complementary tasks during MTL, i.e., fewer tasks with better performance.
In summary, our contributions are three folds: 1) A unified discriminative multi-task PrLM for NLU tasks will be released as a strong counterpart for the dominant T5-based encoder-decoder PrLMs trained with MTL.
2) A probing tool of using task prefix to explore the task relationships in large-scale MTL.We observe that the task relationships reflected by the prefixes manifest a correlation with transfer learning performance, and they help our model achieve better results with complementary tasks.
3) State-of-the-art results on a variety of NLU tasks, especially human-parity benchmark performance on commonsense reasoning leaderboards, i.e., HellaSwag and αNLI.
2 Background and Related Work 2.1 Self-supervised Pre-training PrLMs are commonly pre-trained on large-scale corpora and then used for fine-tuning individual tasks.One of the most widely-used pre-training tasks is masked language modeling (MLM), which first masks out some tokens from the input sentences and then trains the model to predict them by the rest tokens.There are derivatives of MLM including permuted language modeling in XLNet (Yang et al., 2019) and sequence-to-sequence MLM in MASS (Song et al., 2019) and T5 (Raffel et al., 2019).Beyond the general-purpose pre-training, domain-adaptive pre-training and task-adaptive pretraining have attracted attention in recent studies. 1) Domain-adaptive Pre-training.To incorporate specific in-domain knowledge, domain-aware pretraining is designed, which directly post-trains the original PrLMs using the domain-specific corpus.Popular models have been proposed in the dialogue domain (Whang et al., 2020;Wu et al., 2020a), as well as in the medical and science domains (Lee et al., 2020;Beltagy et al., 2019;Huang et al., 2019a;Yu et al., 2022).
2) Task-adaptive Pre-training.The goal of taskadaptive pre-training is to capture task-specific skills by devising the pre-training tasks.The popular application scenarios include logical reasoning and dialogue-related tasks Kumar et al. (2020); Gu et al. (2020); Zhang and Zhao (2021); Li et al. (2021).For example, Whang et al. (2021) proposed various utterance manipulation strategies, including utterance insertion, deletion, and retrieval, to maintain dialog coherence.[RTE]: Twelve of Jupiter's moons are ...

Multi-task Learning for PrLMs
Our concerned MTL in the field of PrLMs is partially related to the studies of task-adaptive pretraining discussed above.The major difference is that the PrLMs in MTL are fed with humanannotated datasets instead of those automatically constructed ones for self-supervised tasks.Figure 2 overviews the paradigms of MTL PrLMs.Existing methods in this research line mostly vary in model architectures and training stages.For example, MT-DNN (Liu et al., 2019a) applied multi-task learning to train a shared model on all the target datasets in the fine-tuning stage, and there are several task-aware output modules to adapt the shared representations to each task.Recent studies, such as ExT5 (Aribandi et al., 2021), T0 (Sanh et al., 2021), and FLAN (Wei et al., 2021), commonly applied an Encoder-Decoder architecture and convert a variety of tasks into the same text-to-text format and train those tasks jointly (Figure 2-a).
We argue that they are not the optimal solution considering the model complexity and the gap between original and transformed task formats, especially for natural language understanding tasks that are in a discriminative manner, e.g., classification, multiple-choice, etc. Actually, there are studies (McCann et al., 2018;Keskar et al., 2019;Li et al., 2020;Khashabi et al., 2020) that transform traditional tasks into other formats like reading comprehension or question answering and achieve better results than prior methods.These studies motivate us to explore superior model backbones and data formats, especially for the application in NLU tasks.

Modeling Task Relationships in MTL
Modeling task relationships is a classic topic in deep learning studies.Bingel and Søgaard (2017) studied the research question about what task relations make gains in traditional natural language processing tasks and investigated when and why MTL works in sequence labeling tasks such as chunking, sentence compression, POS tagging, keyphrase detection, etc. Wu et al. (2020b) found that task data alignment can significantly affect the performance of MTL and proposed architecture with a shared module for all tasks and a separate output module for each task.Since these methods require additional modifications of model architecture, they are suboptimal for employment in PrLMs, considering computational bottlenecks and generality when task scaling.In the era of pre-trained models, Geva et al. (2021) analyzed the behavior transfer in PrLMs between related jointly-trained tasks such as QA and summarization and thus provided evidence for the extrapolation of skills as a consequence of multi-task training.ExT5 (Aribandi et al., 2021) evaluated the transfer performance among task families in a multi-task co-training setup and observed that negative transfer is common, especially when training across task families.Although there are recent studies that insert prompts to describe the task requirements in the data sequences (Liu et al., 2021;Su et al., 2022;Qin et al., 2021;Vu et al., 2022), it is still not clear whether the prompts help negative transfer or whether the prompts necessarily capture task relationships.In this work, we find that using task prefixes along with the MLM for prefix prediction effectively indicates task relationships and helps MTL with fewer datasets but better performance.

Task Format
According to prior studies (McCann et al., 2018;Keskar et al., 2019;Khashabi et al., 2020), the benchmark results on a task can be affected dramatically by training a model on different formats of the same dataset.In contrast to converting all tasks in a text-to-text manner, we choose to model our tasks in a multiple-choice-like format to minimize the format transformation for NLU tasks.Our transformation aims to ensure that each example in a task has a specific number of k candidate options during the multi-task training stage.The original pair-wise input texts are regarded as context and question in the view of the multiple-choice problem.If there is only one text given, then the question will be kept empty.For the outliers, the data will be processed as follows (Examples are provided in Appendix A.1).
1) If the number of candidate options > k, the redundant options will be randomly discarded; 2) If the number of candidate options < k, add "N/A" placeholder options.
3) If the ground truth is a list, randomly select a correct option from the gold list and randomly sample k − 1 negative options from the held-out set3 except the left items in the gold list.
4) If the ground truth is a list and there is an empty choice, construct the truth option manually.For example, "there is no violation"; the negative examples are constructed as the same as 3).
As a result, each training example will be formed as a sequence like {[Prefix]: context, question, option}, where [Prefix] indicates the task name in natural language such as [hellawag] prepended to each data example.

CompassMTL
Our model is encoder-only, which is based on the DeBERTa architecture (He et al., 2021).The model is trained by using both the supervised task objective and the standard self-supervised denoising objective as described below.
Suppose that we have a dataset D = {(y i , c i , q i , r)} N i=1 , where c i represents the context, q i represents the question, r denotes a set of answer options r = {r 1 , . . ., r k }, and y i is the label.N is the number of training data.Each data example is formed as The goal is to learn a discriminator g(•, •) from D. For the supervised task, the loss function is: j=1 log(g(c i , q i • r j )).At the inference phase, given any new context c i , question q i and options r, we use the discriminator to calculate g(c i , q i • r j ) as their matching score where • denotes concatenation.The option with the highest score is chosen as the answer for the i-th example.
Let xi denote the masked sequence where a certain proportion of tokens in x i are randomly replaced with a special [MASK] symbol.Using xi as the input fed to the model in parallel with x, the self-supervised denoising objective is computed in the way of MLM: , where t i, j is the j-th token in x i and M denotes the index set of masked tokens for which the loss will be computed.To encourage the model to learn from both supervised and self-supervised signals, we combine L mtl and L mlm during training: L = L mtl + λL mlm where λ is a hyper-parameter to balance the weight of the training objectives.
Compared with traditional MTL methods, Com-passMTL is data-centric, without any modification of model architecture (Figure 2-b).It can be regarded as an efficient implementation of the traditional MTL method composed of a shared representation module and multiple task-aware modules.Since the data from the same datasets share the same task prefix, the prefix is supposed to reflect the common patterns from the dataset, which works in a similar operational principle to the shared representation module.During the training with our self-supervised objective, task prefixes will be randomly masked in a specific probability. 5he model is required to distinguish the task prefixes and predict the right prefix according to the input data.Therefore, the task differences will also be necessarily captured.

Task Relationship Exploration
Regarding the task prefixes as the compass to navigate the task relationships, it is possible to use our framework to analyze the relevance of For a target task, we can directly rank the toprelated tasks according to the correclation scores and use those complementary tasks for MTL before fine-tuning a target task (Figure 2-c).

Implementations
Our model is implemented using Pytorch and based on the Transformers Library (Wolf et al., 2019).To save computation, we initialize our model with the released checkpoints of DeBERTa-V3-Large, and the hyper-parameter setting generally follows DeBERTa (He et al., 2021) (Aribandi et al., 2021).ExDeBERTa is our imitation of ExT5-style (Aribandi et al., 2021) MTL training by using DeBERTa backbone trained on 40 datasets with a multi-task objective of self-supervised denoising and supervised task objective, after which is transferred to each individual task."w/ Tailor" denotes multi-task training with related datasets (14-subset) according to our discovery in Section 5.3.are run on 8x32GB Tesla A100 GPUs.The maximum input sequence length is 512.Similar to Lourie et al. (2021), the implementation of CompassMTL includes two procedures.We first conduct multi-task pre-training on all the datasets and then continue to train on each target dataset alone to verify the performance.For multi-task pretraining, we use a peak learning rate of 6e-6 with a warm-up rate of 0.1.We run up to 6 epochs using a batch size of 128.The masking ratio of MLM is 0.25, and λ is set to 0.1.To avoid large-scale datasets dominating the pre-training, the training data is randomly sampled by a limit of 10k on the maximum dataset size according to Raffel et al. (2019).For fine-tuning experiments, the initial learning rate is selected in {3e-6, 6e-6, 8e-5} with a warm-up rate of 0.1.The batch size is selected in {16, 32}.The maximum number of epochs is chosen from {6,10}.More fine-tuning details are available in Appendix A.2.

Main Results
Our main results are reported on the Rainbow and LexGLUE benchmark datasets for comparisons with public methods.As the statistics shown in Tables 1-2, we see that CompassMTL models outperform the related public models in general.Specifically, it is observed that our encoder-only models yield better performance than the T5-based encoder-decoder models under similar model sizes.

Relationship Probing
Figure 4 illustrates the heatmap of task relationships probed by prefix embeddings.We see that the datasets inside the same task family (e.g., GLUE and Rainbow) correlate highly with each other.The LexGLUE tasks are less related to other tasks because the texts are mainly legal descriptions.In addition, the correlation scores also accord with the common practice of data augmentation.For example, the NLI datasets (MNLI, QNLI, RTE) share close relevance, and it is helpful to initialize parameters from an MNLI model to fine-tune RTE (Liu et al., 2019b;Qu et al., 2020).We are interested in whether the probed relationship scores coordinate with the model performance transferred between tasks.We first obtain transfer accuracy between tasks in a dual-task training setup (Aribandi et al., 2021).Assume that we have 13 source tasks from GLUE and Rainbow tasks and 5 target tasks (αNLI, HellaSwag, MRPC, PIQA, QNLI, and RTE).We first train individual models using the mixture of training sets from each pair of source and target tasks, and then evaluate the model on the validation set of the target dataset.As a result, we have 5 × 13 transfer results.For each target dataset, we calculate Pearson correlation between relationship scores and transfer accuracy among the source datasets.In Table 4, we find that the relationship scores are positively bound up with the transfer performance.The results indicate the potential to find related tasks by the relationship scores.In other words, the relationship scores essentially reflect task relationships.Task relationships may also be reflected by shallow token distributions, such as vocabulary overlap or sentence length.To investigate if our relationship probing can be replaced by comparing the token distributions, we further analyze the correlation between the similarity of token distributions and dual-task transfer accuracy.For sentence length, we first calculate the absolute values of the average length difference between source and target datasets and then convert them to negative values (intuitively less difference in length, more close the relationship).The vocab overlap of the source and target datasets is also computed for comparison.The similarity between datasets reflects weak correlations with the transfer accuracy (2/5 and 3/5 datasets, respectively in Table 4).These results are less consistent than our probing method, which indicates that our method mines more complex patterns toward task relationships.

Complementary Transfer
To inspect whether using more datasets always leads to better performance and whether using the most related datasets can lead to competitive results.In this part, we conduct a complementary transfer analysis by selecting a group of datasets to train an MTL model and fine-tuning the model on target datasets.Four choices of dataset mixture are compared: 1) 40-fullset: the same as our basic setting of CompassMTL in this work; 2) Top-5 ranked dataset according to based on our probed relationship scores; 3) Family: the datasets belonged to the same family with the target dataset, i.e., 6 datasets for Rainbow tasks and 7 datasets for GLUE tasks; 4) 14-subset: the mixture of Rainbow and GLUE datasets.Table 5 presents the comparison results.We observe that the top-5 ranked variant yields comparable, even better results than the others, which indicates that models trained with more datasets may not always bring benefits.The results also indicate that small-scale datasets (e.g., MRPC and RTE), which have relatively high average correlation scores with the other datasets, are more likely to benefit from the complementary transfer.With the tasks scaling up, the performance (family → 14-subset) may improve as more related tasks are involved in training.

Human-parity on Commonsense
Reasoning Leaderboards graphs, our models establish new state-of-the-art results and reach human-parity performance.

Beyond The Unified Format
To verify whether our model can be employed for tasks that are unavailable to be transformed into our unified format, we evaluate the effectiveness of CompassMTL by using the typical reading comprehension datasets SQuAD v1.1/2.0 (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) ) and named entity recognition (NER) dataset CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003), which represent extractive question answering and sequence labeling task formats, respectively.We first replicate the baselines for fine-tuning QA and NER tasks using the Transformers toolkit.9For comparison, we initialize the baseline parameters with our model weights to see if CompassMTL is better than the baselines.Results in Table 6 show that our model is generally effective across formats.The results also indicate that CompassMTL can serve as a strong off-the-shelf representation encoder that is applicable for new tasks without needing to be pretrained again.

Implementation Using The T5 Backbone
Although our method is implemented by the encoder-only backbone to compete in NLU tasks, it is supposed to be generally applicable to other kinds of PrLMs, such as encoder-decoder T5.To verify the effectiveness, we employ the pre-trained T5-base model (Raffel et al., 2019) as the backbone.We use the Rainbow datasets for MTL and convert the data into text-to-text format following the standard processing for T5 training, with task prefixes inserted before each data sequence.The baselines are the single-task T5 trained on each individual task and UNICORN (Lourie et al., 2021) trained on the Rainbow datasets.Results in Table 8 verify that our method is generally effective.

Conclusions
This work presents a task prefix guided multi-task method by making use of task prefix to explore the mutual effects between tasks and improve model performance with complementary tasks.Our released model can not only serve as the strong foundation backbone for a wide range of NLU tasks but also be used as a probing tool for analyzing task relationships.Our model shows generalizable advances over tasks in diverse formats and establishes human-parity results on commonsense reasoning tasks.Based on our pre-trained model, we find that the prefixes necessarily reflect task relationships, which correlate with transfer learning performance between tasks and suggest directions for data augmentation of complementary tasks.In summary, our work has the following prospects for future studies: 1) Collaborative multi-task learning of PrLMs.The recipe of using task prefixes in conjunction with prefix prediction in MLM training has shown effective for large-scale MTL pre-training.
2) Suggestive choice for data augmentation.
The task relationships probed by the prefix embeddings have shown informative in finding the complementary tasks.Using complementary tasks helps obtain better performance for a target task, especially for small-scale task datasets.
3) Guidance for skill-aware model evaluation.
The discovery of task relationships may help determine redundant datasets that assess similar patterns of models.Recently, there has been a trend to evaluate the comprehensive skills of deep learning models by using a large number of datasets (Srivastava et al., 2022), the selection of distinctive datasets can be guided by our relationship discovery criteria to avoid evaluation redundancy and save computation.
Limitations.We acknowledge the major limitation of this work is that our model may not readily apply to new tasks.It is based on the common assumption of MTL that the set of tasks is known at training time.Adaptation to new tasks could be future work.
After the model is pre-trained with MTL, we fetch the prefix embeddings from the model embedding layer and calculate the Pearson correlation between each task pair with min-max normalization.Assuming that we have n tasks, the process will result in n × n correlation scores to indicate the task relationships.

Table 2 :
Results on LexGLUE test sets.The baseline results except ours in the last column are from Chalkidis et al.
(2021).Since the LexGlue tasks except CaseHold are multi-label classification problems, the ExDeBERTa model is not directly applicable for those tasks without extra task-specific fine-tuning; thus, the results are not reported."w/ Tailor" denotes multi-task training with the seven datasets in the same LexGLUE family.

Table 3
mtl and L mlm , respectively.The results suggest that both supervised and selfsupervised tasks contribute to the overall model performance, and the supervised task is more beneficial than the self-supervised task in our study.Further, to inspect the role of the task prefixes, we

Table 3 :
Ablation Study of the training objectives and task prefixes.We calculate the average accuracy scores on the development sets of all the 40 datasets.

Table 4 :
Dataset RTE MRPC QNLI HellaSwag αNLI Avg.Pearson correlation between each relationship measure and the transfer accuracy.

Table 8 :
Results on the Rainbow validation sets by using T5-base as the backbone model.