Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

Traditional multi-task learning architectures train a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks. Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multi-task conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures.


Introduction
Multi-task learning (MTL) aims to obtain superior results by training a single model over multiple related tasks (Caruana, 1997).Despite MTL's promises, training neural networks with a multitask objective often yields worse results on a given task than task-specific models, resulting in negative transfer.Why do some MTL settings work while others suffer from negative transfer?Many attribute this behavior to optimization challenges in the MTL loss surface (Ruder, 2017), that arise from multi-task conflict: significant differences across the gradients of different tasks which may trap SGD in poor optima (Yu et al., 2020).Prior research has focused on methods to mitigate conflict between tasks during training to ensure consistent MTL improvements (Chen et al., 2018;Sener and Koltun, 2018;Yu et al., 2020;Wang et al., 2020, inter alia).
Research on conflict in MTL has largely focused on the canonical MTL setting: all tasks share an encoder which projects inputs into a shared representation space, and then task-predictions are made using a task-specific head.Recently, it has become common in NLP to leverage a unified textto-text MTL framework: all tasks are framed as sequence generation problems using a single, fullyshared language model to make predictions for each task (Figure 1; McCann et al., 2018;Raffel et al., 2020).This framing relaxes several constraints imposed in the canonical setting: text-totext models impose almost no constraints on the output space, and leave the model to infer the task via natural language prompts.Despite this flexibility, text-to-text models, in combination with unsupervised pre-training, are strong multi-task learners (Raffel et al., 2020;Sanh et al., 2021).
In this work, we are interested in the following question: how does reframing MTL with a single text-to-text model affect negative (or positive) transfer, specifically with regard to multi-task conflict in the training objective?We explore this through the following contributions: • We identify three main factors, or model properties, that change between canonical & textto-text MTL models which may individually impact multi-task conflict and transfer.
• We empirically investigate the effect of each of these factors on negative transfer and multi-task conflict over two standard MTL benchmarks.We find that architectural factors affect multitask conflict but have little effect on transfer.
• Finally, we show that while task prompts are necessary in multi-task learning to specify tasks, improving the semantics of task descriptions with natural language can hurt negative transfer, even if it helps zero-shot capabilities.
Overall, our findings suggest that, despite having a fully-shared parameterization between tasks, textto-text models are surprisingly not inherently superior multi-task learners than canonical multi-task architectures and may similarly benefit from research in conflict mitigation techniques.  2 Background & Related Work 2.1 Canonical Multi-Task Learning Suppose we have K tasks, each with a dataset D k over X × Y k , where X is all sequences of tokens from a fixed vocabulary and Y k is a task-specific output space.Our goal is to learn task-specific models f k : X → Y k for each task while sharing information across all tasks, without any prior knowledge about how tasks are related.Canonically, this is done by training a shared encoder g θ : X → R d along with a task-specific head h ϕ k : R d → Y k , to project the input into a shared representation space before predicting into Y k .1 Parameters θ are shared across all tasks, parameters ϕ k are specific to task k, and we specify To train a canonical multi-task model, we first define task-specific loss functions ℓ k (ŷ, y); for instance, ℓ k may be mean-squared error for a regression task or cross-entropy for a classification task.Let L k (Θ) represent the average task loss ℓ k (f (x; Θ), y) over (x, y) pairs from a mini-batch of D k , where Θ = {θ, ϕ 1 , . . .ϕ k } for convenience.At timestep t, we update our parameters with SGD using the gradient update ∇ ϕ k L k (Θ t ) for all parameters ϕ k , and Canonical architectures have long been the dominant approach to deep multi-task learning in deep neural networks (Caruana, 1997;Ruder, 2017).Classically, multi-task conflict occurs in the multitask gradient update ∇ M T θ , and negative transfer arises from difficulties in learning a proper shared encoder model g θ .

Text-to-Text Multi-Task Learning
In the text-to-text (T2T) MTL setting, we presuppose a function r k : Y k → X which maps the task-specific output space into natural language.2Given such an r k for all tasks, our goal is to learn functions f k : X → X for each task k.Because the output space for each task is the same, we can make predictions with a shared decoder j θ d : R d → X , which yields a fully shared encoder-decoder model f = j θ d • g θe : X → X , where θ = {θ e , θ d } are shared across all tasks and g θe is defined as above.
Given f , how do we specify f k ?A common approach is to pre-pend a fixed natural language description of the task p k ∈ X to the input during training and testing, e.g.
, where ⌢ denotes concatenation.The hope of this approach is that p k moves the input into a sufficiently distinct distribution that f learns to reliably predict task k for any input that starts with p k . 3This approach has shown surprising success recently: McCann et al. (2018) frame p k as a task-specific question, and successfully train a joint Q&A model to answer every task's questions; Khashabi et al. (2020) train a single text-to-text system on 19 distinct Q&A tasks, finding that a unified input-output format is competitive with task-specific formats and systems; most recently, T5 (Raffel et al., 2020) and T0 (Sanh et al., 2021) treat MTL as a language modeling problem, pre-pending p k to task inputs and jointly learning all tasks by fine-tuning a pre-trained encoder-decoder language model.

Negative Transfer & Multi-Task Conflict
The differentiating factor between multi-task and single-task learning is how the shared parameters are trained: ∇ M T θ pulls θ towards a region of the optimization surface that might not be explored by a single-task objective.Hypothetically, this allows information to transfer across tasks yielding a stronger model (Caruana, 1997); in practice, multitask objectives often yield negative transfer, underperforming single-task systems.Recent work has attributed negative transfer to conflict between the directions and magnitudes of task gradients.When differences between ∇ θ L k (•) across different tasks are too large, joint optimization may become difficult.As a result, prior work has focused on mitigating conflict by re-weighting task losses (Kendall et al., 2018;Chen et al., 2018;Sener and Koltun, 2018), by aligning gradient directions (Yu et al., 2020;Wang et al., 2020;Chen et al., 2020), or both (Javaloy and Valera, 2022).
Despite progress on mitigating task conflict in multi-task optimization, the relationship between conflict and negative transfer is still poorly understood.Quantifying when differences between task gradients are beneficial, and when they are conflicting, is an open problem; prior work on predicting negative transfer between tasks relies on other heuristics (Standley et al., 2020;Fifty et al., 2021;Liu et al., 2022).To that end, we study how framing multi-task learning in a text-to-text paradigm affects the relationship between tasks from the perspective of both task conflict and negative transfer.

Substantial Differences Between
Canonical and Text-to-Text MTL Despite rising interest in text-to-text multi-task learners (Raffel et al., 2020;Sanh et al., 2021), little attention has been given to the effects of text-to-text architectures, and especially how tasks are specified, on multi-task conflict and negative transfer.We aim to quantify the degree to which different aspects of text-to-text architectures may mitigate or exacerbate multi-task conflict, and it's impact on transfer.We identify three key factors that distinguish text-to-text MTL from canonical multi-task architectures: F1 The head for each task is an auto-regressive language model, often with fewer constraints than task-specific heads.All tasks use the same language-modeling loss function F2 There are no task-specific parameters ϕ k .Instead, all parameters θ are shared across all tasks, and the only gradient necessary for learning is ∇ M T θ .F3 Finally, the MTL objective depends on p k and r k (•), fixed natural language sequences which may or may not encode semantic properties of task k and it's output space.

Factors 1 & 2
Models We use T5 (Raffel et al., 2020), pretrained on the C4 language modeling task, as the base of our models.A pre-trained T5 model can provide rich representations of inputs which can be passed to canonical MTL heads (Ni et al., 2022) or a language model decoder for text-to-text learning.
Our canonical models simply attach task-specific canonical heads to the outputs of a shared T5 encoder, and jointly fine-tune all task-specific and shared parameters.For more details on the canonical head and loss for each task, see Appendix A.
To move from canonical models to models exhibiting Factor 1, we replace each task-specific head with a pre-trained T5 decoder head; that is, for all tasks k we replace h ϕ k : R d → Y k with j ϕ k : R d → X .Note that each decoder is still parameterized with task-specific parameters ϕ k , and therefore each task is still specified by which decoder is used at test-time.We refer to this architecture as Text-to-Text-ID: models which treat all tasks as text-to-text problems, but which still leverage independent heads for task specification.Multi-task model performance is highlighted in red if the performance is lower than single-task performance (negative transfer) and green if the performance is higher (positive transfer).We can see that negative (and positive) transfer across tasks are very similar across text-to-text and canonical MTL, both at a benchmark level and an individual task level.
Factor 2 requires a unified output space for each task in order to remove task-specific parameters.To test the effects of Factor 2 on MTL, we start from Factor 1 models (text-to-text-ID) and combine all task heads into a single decoder head shared across all tasks, j θ d .This yields a fully unified Text-to-Text architecture, equivalent to the original T5 MTL setting (Raffel et al., 2020).Factor 2 removes task-specific parameters from the model, and therefore we must specify tasks at the input.For the following experiments we utilize the default prompts and output spaces listed in Appendix B; in §5 we explore how Factor 3 (different task prompts and output spaces) affect our results.

Datasets
We study negative transfer and multitask conflict on two common NLP multi-task benchmarks.The first is GLUE (Wang et al., 2018), a set of 8 natural language understanding tasks.We compare single-and multi-task models, over both canonical and text-to-text settings, in Table 1.
Text-to-Text-ID models have independent heads, representing Factor 1, and Text-to-Text multi-task models represent Factors 1 & 2. For both GLUE and DecaNLP models, canonical and text-to-text models perform comparably on most tasks.For GLUE, text-to-text models perform marginally below canonical models, on average, notably due to STS-B, a regression task which suffers from the usage of a fixed vocabulary for numeric outputs.For DecaNLP, text-to-text models slightly outperform canonical models: 7 out of the 9 tasks are text-totext or span-labeling tasks, the latter of which may be more amenable to text-to-text learning.
Our results illustrate that both text-to-text and canonical models exhibit similar amounts of negative and positive transfer.On GLUE, MTL reduces performance by nearly 2.5% on average in both text-to-text and canonical settings.In DecaNLP, MTL generally exhibits positive transfer-a trend also noted in the original paper (McCann et al., 2018)-suggesting that these tasks benefit each other.However, again canonical and text-to-text exhibit similar levels of positive transfer (a 8-10% increase in performance).Moreover, the similarities continue at the individual task-level: tasks which experience positive or negative transfer in the canonical models often experience the same amount of positive or negative transfer in text-totext models.
We additionally see that Text-to-Text-ID models, which exhibit Factor 1, are generally worse than Text-to-Text models, which combine Factors 1 and 2. In tasks where this difference is significant, such as STS-B or CoLA, we identify this as an issue with the pre-trained decoder (Appendix C); reinitializing the decoders boosts the strength of STS-B and CoLA, suggesting an incompatibility between language model pre-training and regressionlike tasks.For other tasks, the effect of Factor 2 is less significant: tasks such as translation appear to benefit slightly from task-specific decoders, whereas other tasks, e.g.span-labeling tasks, benefit from a shared decoder.On average, Text-to-Text models outperform Text-to-Text-ID models significantly, suggesting that the benefits of Factor 2 for some tasks outweigh the costs to other tasks in terms of performance, on the settings we consider.In short, when framing multi-task learning as a textto-text problem (Factor 1), it is typically better to share the decoder (Factor 2) as well.
Despite what could be considered radically different approaches to MTL, our findings suggest that Factors 1 and 2 may not have a significant effect on the optimization landscape of multi-task learning.Factors 1 and 2 combined frame MTL as a multi-domain problem, where tasks represent distributions over the same problem space (X → X ).Our results suggest that this re-framing is not sufficient, alone, to mitigate negative transfer.7

Analyzing Multi-Task Conflict
Our results in §3.1 suggest that negative transfer is not impacted by framing MTL as a text-to-text problem.We next empirically examine whether this is borne out via multi-task conflict.Multi-task conflict is typically thought to be a key factor in negative transfer ( §2.3).Multi-task conflict can be summarized as the differences between individual task gradients; while some amount of conflict is necessary for multi-task learning to benefit over single-task learning, too much conflict can create optimization difficulties.Given that we see equivalent amounts of negative and positive transfer across canonical and text-to-text models, we expect to see similar levels of conflict as well.
Measuring Conflict We measure two types of conflict in this work: magnitude and direction.Let Θ be all the parameters of a model, ∇ θ L k (•) be the gradient of θ for task k with parameters Θ, and ∇ M T θ be the multi-task gradient of θ with total parameters Θ (as in §2).We measure magnitude conflict, C mg , as the variance across each task's gradient magnitude: A high value of C mg indicates high conflict between the magnitudes of task gradients, which may lead to optimization being dominated by a few high-magnitude tasks.C mg is highly correlated with the magnitude of the loss at each step and the average gradient magnitude.To control for text-totext models having training losses several orders of magnitude higher than classification losses at the beginning of training, we normalize the variance of gradient norms C mg by the average gradient norm squared N 2 M T at each step.Thus, when comparing gradient conflict across different models, the resulting metric measures the gradient variance if each model had the same average gradient norm.
We measure directional conflict, C dir , as the norm of the multi-task gradient after normalizing ∇ θ to magnitude 1: This metric is equivalent, up to a constant, to the average pairwise cosine similarity between all task gradients, which is a standard measurement for directional conflict in multi-task optimization (Yu et al., 2020;Wang et al., 2020).
Comparing conflict Our goal is to compare how task conflict changes across canonical and text-totext models, and specifically, how Factors 1 and 2 affect task conflict.To do this, we compare C dir and C mg in Canonical, Text-to-Text-ID, and Textto-Text models; in our comparisons across these architectures we measure conflict in the encoder only, which ensures that we are measuring task-conflict across not only the same number of parameters, but over the same architecture as well (a T5 encoder).Finally, we note that conflict in the encoder is largely consistent between Text-to-Text and Textto-Text-ID architectures.Therefore, we conjecture that, if multi-task conflict is beneficial when moving from independent heads to a single joint head, it is largely due to conflict that occurs in the decoder parameters.

Results
Our findings suggest that prior work in multitask learning which studies canonical models, especially work focused on multi-task conflict and negative transfer ( §2.3), may still be relevant in text-totext MTL.We observe that shifting from canonical to text-to-text MTL has little effect on both negative transfer and multi-task conflict.The fully-shared parameterization of text-to-text models does not translate into reductions in task conflict, nor major improvements in positive transfer between tasks.Future work on improving multi-task optimization by addressing conflict in the text-to-text setting is therefore a promising avenue.

Factor 3: Task Prompts & Labels
We have seen that Factors 1 & 2 have little effect on transfer and task conflict, in our settings.We now turn to Factor 3: the form of p k and r k .In text-totext MTL, tasks are specified via natural language prompts which can range in semantic richness from nondescript tokens, such as "SST: ", to descriptions of the task: "Is the sentiment of this review positive or negative?: ".In theory, a T2T multi-task model may leverage the semantics of task descriptions to implicitly align related tasks during learning, especially when leveraging a pre-trained model which exhibits non-trivial prompt understanding.However, Logan et al. (2021) demonstrated that T5 could be fine-tuned with null prompts and retain competitive accuracy, suggesting that models prefer to memorize prompts regardless of their semantic information.If T2T multi-task models memorize task prompts, the form of p k and r k should have no impact on task conflict.

Leveraging Diverse Prompts
Recently, Sanh et al. (2021) proposed T0, a T5 architecture fine-tuned in a massively multi-task manner with a diverse set of crowd-sourced prompts for each task: specifically, they leverage crowdsourced workers to generate multiple distinct, semantically informative p k and r k for each task.Sanh et al. (2021) find that a model trained with these diverse, descriptive prompts have significantly stronger zero-shot capabilities than models trained on one prompt per task, suggesting T0 has learned to leverage semantic information in task prompts.This result has been called into question by Webson and Pavlick (2021), who demonstrate that few-shot performance, even with T0, can still be competitive when using nonsense prompts; however, the strength of T0's zero-shot performance suggests that diversifying the task prompts during learning may improve the model's understanding of task descriptions to some extent: does this correlate with less negative transfer when fine-tuning?
To test this, we leverage promptsource, a repository of diverse prompts for a large set of NLP tasks for text-to-text learning (Sanh et al., 2021).For each task k in GLUE, promptsource contains 4-7 pairs of prompts and text-labels (p k , r k ).We re-train our T2T multi-task GLUE models using a randomly sampled pair (p k , r k ) for each sample in task k; We term this model the diverse prompt model, and compare it's multi-task performance to the default prompt model, a model which uses taskspecific tokens-similar to those in the original T5 approach-which encode no semantic information.Additionally, we train a text-to-text multi-task model with null prompts, e.g.no prompts.This model must learn which task it is performing by the distribution of the input samples alone.
We plot the multi-task results of null, default, and diverse T2T models in Table 2, as well as single-task text-to-text performance; we plot the multi-task conflict of models trained with the 3 different prompts in Figure 3.The cells of Table 2 are highlighted by their magnitude of positive (green) or negative (red) transfer from single-task T2T models, which are trained with null prompts.We find that null prompts perform very poorly in MTL, suggesting that T2T models struggle to determine task specification based on the input text distribution alone.Default prompts, used by the original T5 paper, which are fixed tokens pre-pended to each input, perform much better than null prompts.This result demonstrates the importance of specifying the task at the input during text-to-text MTL; while single-task models can be trained with null, and even nonsensical, prompts to competitive accuracy (Logan et al., 2021), MTL requires the text distribution to differentiate the tasks explicitly.
We additionally find that diverse prompts increase negative transfer compared to default prompts; either understanding task descriptions is not beneficial for multi-task optimization or, more likely, diversifying task prompts is not sufficient to ensure a model leverages the information in p k during training.We note that a limitation of our work is the number of tasks we consider; it is pos-sible that, under a MTL regime with 9 tasks, even a diverse set of prompts is not enough to improve prompt understanding.As a result, it may be that introducing multiple prompts per task serves only to confuse the model as to how tasks are specified, lowering test performance.
Finally, in Figure 3 we see that, once again, directional conflict is largely similar across training paradigms.Null prompt models see slightly less directional conflict towards the end of training, potentially due to the fact that, without taskspecifications, task distributions overlap; however, lower directional conflict is not correlated with stronger task performance in this case.Additionally, diverse prompts appear to have consistently lower magnitude conflict throughout training; again, lower conflict is not indicative of a stronger multi-task model, as default prompt models perform the best out of the 3 specifications.
5.2 Overlap in the Output Space §5.1 suggests that, in the settings we study, the semantics of task specifications have little effect on MTL.To that end, we test a much simpler artifact of Factor 3: overlap across the output spaces of classification tasks.If the semantics of r k is not important to multi-task performance, then hypothetically the selection of terms for class labels should not matter (e.g. using the letter "z" versus the term "entailment").However, one factor that may impact multi-task learning is the amount of overlap in class label terms across different tasks; is it easier to learn multiple classification tasks if their label terms are distinct?To study this, we create 2 additional r k for each classification tasks in GLUE: the first is non-overlapping characters, where we sample a random character for each class label, ensuring that no character is used twice across all tasks.The second is overlapping characters, where we instead sample a few characters, which are reused for class labels over all tasks.Finally, we consider the default terms for class labels, which are semantically meaningful, multi-token terms.
We plot both conflict in the decoder of the model and average classification task accuracy, in Fig- ure 4. We find that, similar to other factors that we study in this work, different output spaces have no observable impact on task conflict (magnitude conflict exhibits the same trend).However, we do observe impacts on model accuracy: models with less overlap in their output space tend to perform Overlap in the output space appears to have a slight effect on model accuracy: output spaces with less overlap results in higher accuracy.
better than models with more overlap, suggesting it is harmful to re-use class labels across different tasks, (although we emphasize that this may only be the case when the semantics of the class tokens are not important, i.e. in the few-task regime).
Overall, we find that it is important to specify tasks in T2T MTL: expecting the model to learn task specification from the input distribution alone leads to high negative transfer.However, when the intent is to create a strong multi-task model via fine-tuning, these prompts and outputs need not be semantically rich: simple task-specific tokens often suffice.Indeed, when MTL is being done in a setting that is not massively multi-task (Sanh et al., 2021;Aribandi et al., 2022), rich and diverse prompts may actually hurt multi-task learn-ing, rather than benefit it.Additionally, throughout this work, our findings have highlighted the disconnect between standard notions of task conflict and negative transfer: task conflict, particularly across gradient magnitudes, is rarely predictive of negative transfer in the settings we study.

Discussion
Conclusions General purpose text-to-text architectures, typically pre-trained on large amounts of data, are increasingly popular both as a topic of academic study and in commercial NLP applications.By using natural language prompts rather than taskspecific parameters, models like T5 can encode multiple tasks in the same framework with fully shared parameters.Empirically, text-to-text models often achieve comparable or superior performance compared to task-specific models.Surprisingly, we find similar task conflict in both canonical and text-to-text models, with similar negative transfer.This begs the question: are text-to-text models fundamentally better multi-task learners than canonical MTL models?Or is their success largely attributable to other factors, such as increased model capacity and pre-training on more data?Our findings suggest the latter: text-to-text architectures are not inherently better multi-task learners than canonical models.In particular, when controlling for other factors, we have found that T5 exhibits comparable transfer characteristics to canonical MTL parameterizations, both in terms of performance on held-out data and over the course of optimization by measuring rates of gradient conflict.
Future directions Recently, a series of optimization methods have been proposed to address the challenges posed by multi-task, or more generally, multi-objective learning (Wang et al., 2020;Javaloy and Valera, 2022).If the optimization landscape of multi-task text-to-text models poses similar challenges as canonical MTL parameterizations, applying special-purpose MTL optimization methods with text-to-text models may result in faster convergence, improvements in performance, or both.However, more research is needed both to understand the connection between task conflict and negative transfer, and to develop new optimization techniques better suited to the fully-shared parameterization used by models like T5.As such, our results suggest several promising avenues for future work, which to our knowledge have lacked clear motivation until now.

Limitations
A foremost limitation of our work is that we consider only a single language, English, largely due to the access of both strong text-to-text models for English as well as several English multi-task benchmarks.While we expect that our findings regarding multi-task learning are generalizable across languages, further experiments involving multi-task benchmarks and pre-trained encoders in other languages are necessary to fully support this claim.
Our work is additionally limited by the multitask settings we examine.While we study two very popular NLP MTL benchmarks, GLUE and De-caNLP, both have limitations and prior work has shown wide variance in how well MTL works on different datasets.Notably, the canonical output space of all tasks is fairly limited, e.g.our work overall considers regression, classification, spanlabeling, and sequence output spaces.However, other classical NLP output spaces are also worth studying, such as constrained structured prediction in Named Entity Recognition or Syntax Parsing.Finally, our MTL settings consider only 9 tasks to be learned jointly.A smaller universe of tasks enables us to fairly compare single-and multi-task models across several different architectures.Recent work has begun to consider massively multi-task models on 100+ tasks simultaneously (Aribandi et al., 2021).While we expect that our findings will generalize to these extreme settings, expanding our experiments to larger universes of tasks would validate this claim. 8wither or not the sentence is grammatical English.In GLUE, this task is treated as a binary classification task, with sentences classified as either acceptable or unacceptable.Our default prompt for this task, following T5, is p k = "cola sentence:" and r k = "acceptable" for acceptable sentences or "unacceptable" otherwise.The evaluation metric for this task is the Mathews correlation coefficient.
SST is a dataset of movie reviews and human annotations of their sentiment (Socher et al., 2013).The dataset is treated as a binary classification dataset, with sentences classified as either positive or negative.The base prompt p k = "sst2 sentence:" and r k = "positive" if the class is positive or "negative" if the class is negative.The evaluation metric for this task is accuracy.
MSRPC is a dataset of sentence pairs annotated for semantic equivalence (Dolan and Brockett, 2005) -in GLUE, this task is framed as a binary classification.The base prompt p k = "mrpc sentence 1: . . .sentence 2: . . ." and r k = "not_equivalent" if the sentences are not equivalent, and"equivalent" otherwise.The evaluation metric for this task is accuracy.
STS-B is a collection of sentence pairs annotated with a similarity score from 1 to 5 (Cer et al., 2017).This task is a regression task -a model is tasked with predicting the similarity score of the two input sentences.The base prompt p k = "STS sentence 1: . . .sentence 2 . . .". r k simply rounds the score of each input to 1 decimal point.The evaluation metric for this task is the Spearman correlation coefficient.
QQP is a dataset of question pairs from the website Quora, annotated for whether or not the questions are semantically equivalent.The task is a binary classification task.The base prompt p k = "qqp question 1: . . .question 2: . . ." and r k is "not duplicate for non-equivalent questions, and "duplicate" otherwise.The evaluation metric for this task is accuracy.
MNLI & MNLI-mm is a dataset of sentence pairs with textual entailment annotations (Williams et al., 2018).The task is to determine whether the second sentence (the hypothesis) is entailed, contradicted, or neither by the first sentence (the premise).The task is a 3-class classification task.The base prompt p k = "MNLI premise: . . .hypothesis: . . ., and r k is either "entailment", "neutral", or "contradiction".The evaluation metric for this task is accuracy.The task comes with an additional mismatched test-set, MNLI-mm, which contains inputs from different domains than the training domain.
QNLI is an entailment dataset constructed from SQUAD (Rajpurkar et al., 2016) by generating question-sentence pairs from a question, and each sentence in it's corresponding context.The task is to determine whether to context sentence contains the answer to the question, making the task a binary classification task.Our base prompt p k = "qnli question: . . .sentence: . . .", and r k is "entailment" if the context contains the answer, and "not entailment" otherwise.The evaluation metric for this task is accuracy.
RTE is composed of a series of textual entailment datasets constructed from news and Wikipedia text.The task is to determine whether sentence 2 is entailed by sentence 1, making the task a binary classification task.The base prompt p k = "rte sentence 1: . . .sentence 2: . . ." and r k is "entailment" is sentence 2 is entailed and "not entailment" otherwise.The evaluation metric for this task is accuracy.

B.2 DecaNLP
The NLP Decathlon (DecaNLP;McCann et al., 2018) frames 9 NLP tasks as question-answering problems, and proposes a novel end-to-end Q&A architecture (MQAN) to solve them.We adopt 8 of the 9 tasks for our work, which we list and briefly describe below.Unlike GLUE, DecaNLP's framing of all tasks as questions results in some tasks naturally being "self-specified" in the text-to-text framework; for example, in SQUAD, the "question" portion of the input specified to the model what task it should perform.Thus, no prompt p k is necessary for SQUAD -the question specifies the task.
SST is the task of classifying whether or not a given sentence has a positive sentiment (Socher et al., 2013), and also appears in GLUE.This task is a binary classification task.In DecaNLP the base prompt p k is " . . .Is this review negative or positive?",and r k is "positive" for positive reviews and "negative" otherwise.The evaluation metric for this task is accuracy.
MNLI is a dataset of sentence pairs with textual entailment annotations (Williams et al., 2018), and also appears in GLUE.The task is to determine whether the second sentence (the hypothesis) is entailed, contradicted, or neither by the first sentence (the premise).The task is a 3-class classification task.The base prompt p k = "Context: . . .Premise: . . .-entailment, neutral, or contradiction?, and r k is either "entailment", "neutral", or "contradiction".The evaluation metric for this task is accuracy.
IWSLT is a machine translation dataset -we leverage specifically the English to German IWSLT 2016 task (Cettolo et al., 2016).This task is a sequence to sequence task.Our base prompt is p k = ". . .Translate this sentence into german", and r k is the identity map.The evaluation metric for this task is BLEU.
CNN / Dailymail is a summarization dataset of news articles (Nallapati et al., 2016).The task is two generate an abstractive summary of a given input, making this a sequence to sequence task.Our base prompt p k = ". . .What is the summary?", and r k is the identity map.The evaluation metric for this task is ROUGE.
Seq2SQL is a semantic parsing dataset, where the task is to convert a natural language request into a SQL query given some additional dataset information (Zhong et al., 2018).This is a sequence to sequence task.Our base prompt p k = ". . .What is the translation from English to SQL?", and r k is the identity map.The evaluation metric for this task is Logical EM, where query outputs are converted back into database logic and evaluate at that level.
SQUAD is an extractive question answering dataset (Rajpurkar et al., 2016).For each example in squad, a context and question are given -the question is over the given context, and the answer to the question exists as a span inside the context.Thus, this task is a span-labeling task.The task is specified by the question, so we use no prompt for this task.r k is simply a conversion of the span start and end points into the natural language sentence that they cover.The evaluation metric for this task is nF1.
QA-SRL is a semantic role labeling dataset -semantic role labeling is the task of assigning semantic roles, such as agent, goal, or result, to constituents of a sentence.QA-SRL (He et al., 2015) frames this task as a question and answering problem, where the question specifies a semantic role and asks which constituent fulfills that role, and the answer is a span of the text.Thus, this task is a span-labeling task.The task is specified by the question, so we use no prompt for this task.r k is simply a conversion of the span start and end points into the natural language sentence that they cover.The evaluation metric for this task is nF1.
QA-ZRE is a relation extraction dataset -the task of extracting relationships between one or more entities of a sentence.Similar to QA-SRL, QA-ZRE (Levy et al., 2017) formulates this problem as a question-answer dataset.Here the question specifies a relation, and perhaps an entity, and then asks what other entity in the sentence matches this relationship.The answer can either be a span from the sentence, or that no entity fulfills that relationship.Thus, this task is a span-labeling task.The task is specified by the question, so we use no prompt for this task.r k is simply a conversion of the span start and end points into the natural language sentence that they cover, or "unanswerable" if no entity fulfills the relationship.The evaluation metric for this task is corpus-level F1.
Wino is a pronoun resolution dataset (Levesque et al., 2012), with the task being to identify which entity is being referred to by a specific pronoun in the sentence.In DecaNLP this dataset is framed as a Q&A problem task, where the question simply asks which entity in the context is being referred to by the ambiguous pronoun.In the canonical setting this can be done by labeling the entity in the text which is being referred to, yielding a spanlabeling task.The task is specified by the question, so we use no prompt for this task.r k is simply a conversion of the span start and end points into the natural language sentence that they cover, or "unanswerable" if no entity fulfills the relationship.The evaluation metric for this task is EM accuracy.

C Re-Initializing Sequence Decoders
When studying the effects of Factor 1 in §3, we leverage pre-trained T5 decoders as the replacement for each task-specific head.Using T5 decoders as task-specific heads moves all task output spaces into X , and ensures that all tasks can be trained with the same loss function (L LM ).However, pre-training is not necessary for this shift, and could be considered a confounding factor in moving from canonical to text-to-text models, as most canonical heads do not benefit from pre-training.

Figure 1 :
Figure 1: Canonical (left) vs. Text-to-Text (right) Multi-Task (T5) Architectures.In this work we are interested in how negative transfer and multi-task conflict are affected by moving from Canonical MTL to Text-to-Text MTL.

Figure 2 :
Figure 2: Magnitude conflict (C mg , top, normalized by average gradient magnitude) and directional conflict (C dir , bottom) across GLUE (left) and DecaNLP (right) tasks in canonical, text-to-text, and text-to-text-ID models.Directional conflict in the encoder follows similar trajectories across different architectures, initially noisy and eventually stabilizing to similar values; however, magnitude conflict differs across settings, and is likely tied to the homogeneity of canonical task loss functions.

Figure 3 :
Figure3: Magnitude conflict (C mg , top) and directional conflict (C dir , bottom) in GLUE models using null, default, and diverse prompts.We find that increasing prompt diversity generally lowers task conflict, suggesting that it pushes task distributions closer together.

Figure 4 :
Figure4: (Top): Directional conflict (C dir ) in the decoder of GLUE models when using different classification output spaces.The output space used, and the amount of overlap across tasks, has no observable effect on multi-task conflict.(Bottom): The impact of classification output spaces on multi-task performance.Overlap in the output space appears to have a slight effect on model accuracy: output spaces with less overlap results in higher accuracy.

Table 1 :
Negative transfer between text-to-text and canonical models.Results are averaged over 3 random seeds.

Table 2 :
We plot C mg and C dir across the training trajectories of canonical, text-to-text-ID, and textto-text models in Figure2.We see that directional conflict remains similar across all settings.Multi-task model performance for GLUE models trained on null, default, and diverse prompts.All results are averaged over 3 random seeds.While diverse prompts may help with zero-shot transferSanh et al. (2021), they do not necessarily improve negative transfer in text-to-text multi-task models.
transfer: canonical models have lower magnitude conflict in GLUE and higher magnitude conflict in DecaNLP than text-to-text models, yet both benchmarks exhibit similar transfer across canonical and text-to-text models.