FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue

Task transfer, transferring knowledge contained in related tasks, holds the promise of reducing the quantity of labeled data required to fine-tune language models. Dialogue understanding encompasses many diverse tasks, yet task transfer has not been thoroughly studied in conversational AI. This work explores conversational task transfer by introducing FETA: a benchmark for FEw-sample TAsk transfer in open-domain dialogue.FETA contains two underlying sets of conversations upon which there are 10 and 7 tasks annotated, enabling the study of intra-dataset task transfer; task transfer without domain adaptation. We utilize three popular language models and three learning algorithms to analyze the transferability between 132 source-target task pairs and create a baseline for future work.We run experiments in the single- and multi-source settings and report valuable findings, e.g., most performance trends are model-specific, and span extraction and multiple-choice tasks benefit the most from task transfer.In addition to task transfer, FETA can be a valuable resource for future research into the efficiency and generalizability of pre-training datasets and model architectures, as well as for learning settings such as continual and multitask learning.


Introduction
Improving sample efficiency through transfer learning has been a long-standing challenge in the machine learning and natural language processing communities (Pratt et al., 1991;Ando and Zhang, 2005).Dialogue data requires multiple cohesive turns with consistent speaker personalities (Urbanek et al., 2019;Huang et al., 2020), creating a challenge for data collection and motivating the development of techniques that improve sample efficiency in conversational AI (Lin et al., 2020).Furthermore, dialogue understanding tasks require a shared knowledge of semantics, pragmatics, human behavior, and commonsense, making dialogue an area of study that can benefit greatly from a deeper understanding of transfer learning.
Two essential transfer learning settings, namely domain adaptation and task transfer, have been studied on language tasks (Ruder et al., 2019).While domain adaptation has been studied in taskoriented dialogue (Mehri et al., 2020) , task transfer has been studied with less rigor in conversational AI.Prior studies of task transfer in dialogue consider only 2-4 tasks, focus on multitask learning, and do not compare learning algorithms (Hosseini-Asl et al., 2020;Peng et al., 2021b).
Prior studies have focused on cross-dataset task transfer, gathering tasks annotated on disjoint datasets (Vu et al., 2020;Ye et al., 2021), but this can lead to improvements in domain adaptation being confounded as improvements in task transfer.A precise study of task transfer should be on a sin-gle data source in an intra-dataset transfer setting, as in Zamir et al. (2018).Additionally, previous studies focus on learning algorithms and use only a single language model architecture (Pruksachatkun et al., 2020;Lourie et al., 2021;Aribandi et al., 2022), which may lead to a narrow understanding.To the best of our knowledge, this is the first rigorous study on task transfer in dialogue and the most extensive intra-dataset task transfer study in NLP.
In this work, we create FETA, a benchmark for few-sample task transfer for language understanding in open-domain dialogue with 17 total tasks.FETA datasets cover a variety of properties (dyadic vs. multi-party, anonymized vs. recurring speaker, varying dialogue lengths) and task types (utterance-level classification, dialogue-level classification, span extraction, multiple-choice), and maintain a wide variety of data quantities.
We study task transfer on FETA by comparing three task transfer algorithms and three commonly used language models in single-source and multisource settings.Figure 1 illustrates some results in the single-source setting.For example, we find that Dialogue Reasoning Span Extraction benefits from nearly all source tasks.On the other hand, Adversarial Response Selection and Emotion Recognition improve the performance of many target tasks when utilized as a source task.
In this study, we find that: (i) Trends are largely model-dependent, a finding that previous works have not discussed.(ii) Out of all task types, span extraction tasks gain the most as a target, especially with few samples.(iii) Adding source tasks does not uniformly improve over a single source task, motivating a better understanding of the complex relationship between source and target tasks.
FETA provides a resource for various future studies, e.g., on the generalizability of model architectures, and pre-training datasets that enable efficient transfer.In addition to task transfer, FETA can also facilitate the study of continual and multitask learning.
In summary, our main contributions are: • We create the first large-scale benchmark for task transfer in dialogue, with 132 sourcetarget task pairs.
• Extensive experimentation on FETA in both the single-source and multi-source settings, and an in-depth analysis comparing models, learning algorithms, sample sizes, and task types, finding new and non-intuitive results.
• A readily extensible transfer learning framework2 that allows for rapid experimentation and an online leaderboard3 to encourage deeper research into task transfer.
More recently, DialoGLUE (Mehri et al., 2020) and RADDLE (Peng et al., 2021a) study domain adaptation for language understanding tasks in taskoriented dialogue.Intra-dataset Task Transfer Intra-dataset task transfer has been studied in computer vision applications (Zamir et al., 2018;Pal and Balasubramanian, 2019), but to our best knowledge it has never been studied in NLP.

FETA
In this section, we briefly define intra-dataset task transfer, the problem setting of FETA.Then, we introduce FETA, our benchmark for few-sample task transfer in open-domain dialogue.Finally, we define the metrics we use to evaluate models and learning algorithms on FETA.

Problem Definitions
Let a dataset be composed of the instance set, X, FETA, each instance x ∈ X is a dialogue.
Definition 1 (Domain and Task).A domain D = {X , P (X)} consists of a feature space X and a marginal probability distribution P (X).The marginal probabilities are over the instance set Definition 2 (Learning Algorithm).A learning algorithm, A, is a protocol that determines the method by which the instance set X and taskspecific label sets Y 1 , Y 2 , . . ., Y n will be used to train a predictive function, f .Definition 3 (Task Transfer).Given a source task T S = {Y S , f S (X S )} and target task T T = {Y T , f T (X T )}, task transfer is the use of a learning algorithm, A, to improve the learning of f T by using the knowledge in T S .
In cross-dataset task transfer, when X S ≠ X T , we also have P (X S ) ≠ P (X T ) and D S ≠ D T ; domain shift.
In intra-dataset task transfer, when X S = X T , there is no domain shift.This enables the study of the learning algorithm's performance on task transfer, isolated from domain adaptation.
We refer the reader to Pan and Yang (2010) and Zhuang et al. (2021) for expanded discussions on transfer learning definitions.
Few-Sample Due to the challenge and cost of collecting and annotating data, many real-world applications of NLP techniques are limited by data quantities.For this reason, we focus on the fewsample setting, defined in FETA as 10% of the original instance set.Out of 10%, 5%, and 1%, 10% was empirically determined to be the smallest percentage that retains labels from all label sets in both the train and development partitions.Given the recent attention focused on NLP applications in low-resource settings (Brown et al., 2020;Bansal et al., 2020;Mukherjee et al., 2021;Ye et al., 2021), we expect research done in such a low-data setting will lead to insights useful for many researchers and practitioners.

FETA Datasets
In this section, we describe the two dialogue sources we use, DailyDialog (Li et al., 2017) and Friends (Chen and Choi, 2016), and the tasks annotated on each source.
We select these datasets because they complement each other in desirable ways.DailyDialog contains 2-speaker dialogues where speakers are anonymized and averages 88 words per dialogue.In contrast, Friends consists of multiparty dialogues (3.6 speakers mean, 15 max) with recurring characters and averages 283 words per dialogue.These differences lead to each set of dialogue instances having different task annotations, giving FETA a wider variety of tasks.For example, Dai-lyDialog tasks include understanding the causes of emotions and commonsense reasoning, while tasks annotated on Friends revolve more around recog- nizing entities and understanding personalities.
To create FETA versions of each dataset, we first partition the dialogues into 70/15/15% splits for training, validation, and test sets.After splitting, we randomly down-sample the train and development dialogues to 10% of the original quantities.Thus, FETA splits use 7/1.5/15% of the original dialogues.Not every dialogue is annotated for all tasks, allowing some tasks to have more samples than others.Crucially, the data splits are the same for all tasks, preventing data leakage.Table 1 shows an overview of the tasks, samples, and metrics used for each dataset.Li et al. (2017) present the DailyDialog dataset, with chit-chat conversations covering 10 various topics including relationships, politics, and work.

FETA-DailyDialog
Many works add annotations on top of these dialogues and FETA utilizes 10 of them.Figure 2 provides an overview of the tasks: emotion recognition, dialogue act classification, topic classification (from DailyDialog (Li et al., 2017)), causal emotion span extraction, causal emotion entailment (from RECCON (Poria et al., 2021)), dialoguelevel natural language inference, dialogue reasoning span extraction, dialogue reasoning multiple choice, commonsense relation extraction (from CIDER (Ghosal et al., 2021)) adversarial response selection (from DailyDialog++ (Sai et al., 2020)).For further details of these tasks, we refer the reader to Appendix A and their original papers.

FETA-Friends
The Friends dialogues come from transcripts of 10 seasons of the TV show by the same name (Chen and Choi, 2016).In addition to dialogue, the transcripts contain situational information such as behaviors and non-verbal information like scene information.
In total, FETA has 7 task annotations on top of the Friends scripts.As illustrated in Figure 2, the incorporated tasks include Emory emotion recognition (from (Zahiri and Choi, 2018)), reading comprehension (from (Ma et al., 2018)), character identification (from (Chen and Choi, 2016;Zhou and Choi, 2018)), question answering (from (Yang and Choi, 2019)), personality detection (from (Jiang et al., 2020)), and relation extraction (from Dialo-gRE (Yu et al., 2020a)) and MELD emotion recognition (from MELD (Poria et al., 2019)).There are two emotion recognition label sets (Emory and MELD), but they have only 22% overlap in instance sets and have different label spaces.For further details of these tasks, we refer the reader to Appendix A and their original papers.

Evaluation Metrics
To define the metrics, we consider 4 variables: source task s, target task t, model f , and learning algorithm A, and we abuse notation slightly to allow for f A (s, t) to represent a model trained on the source and target tasks using the given learning algorithm.In FETA, we evaluate the performance of a model and learning algorithm with multiple metrics: average and top-1 raw scores, as well as average and top-1 score ∆s.
Average and Top-1 Scores First, we consider the two raw scores: average score and top-1 score.These metrics aim to answer the following questions: How well do a model and algorithm perform across all task pairs, and, how well do a model and algorithm perform supposing that we knew the best source task a priori.
We calculate an average score across all sourcetarget task pairs to understand how each model and algorithm performs in the aggregate.Formally, let the score for a single task be computed as: where M t is the set of metrics associated with task t, found in Table 1, and M t,i (f ) is the ith calculated metric of model f on task t.All metrics range from 0 to 100.Then, we calculate the average score as: where T is the set of tasks.
Additionally, we calculate top-1 score to understand how models and algorithms perform if the best source task is known ahead of time.This score is calculated as the maximum score over source tasks averaged over target tasks.The top-1 score does not consider scores less than the baseline, which is a model trained directly on the target task.Denote the baseline algorithm by A B and the baseline score as score(s, t, f, A B ). Formally, the top-1 score is calculated as: Average and Top-1 ∆s In addition to raw scores, we also calculate score differences to measure how much a source task benefits a target task.The average ∆ describes how much benefit the model saw in the aggregate over all source tasks, while the top-1 ∆ considers only the best source.Score ∆s are calculated with respect to the baseline score as: and the average ∆ is calculated as: Additionally, we calculate the top-1 ∆ as the maximum positive score difference over source tasks averaged over target tasks: |T |

Task Transfer Algorithms
In this work, we consider three commonly used task transfer methods: Pre-train/Fine-tune, Multitask, Multitask/Fine-tune.We apply these methods with cross-entropy loss to further optimize pretrained language models on FETA.
Pre-train/Fine-tune Commonly used in NLP today, the pre-train/fine-tune algorithm consists of two stages of training (Pratt et al., 1991).First, the model is trained on the source task T S , optimizing Eq 1, followed by a separate stage of training on the target task T T , optimizing Eq 2: Multitask In this algorithm, there is only a single stage of multitask training (Caruana, 1994).Formally, the training is conducted on both the source and target task by optimizing Eq 3: Multitask/Fine-tune This algorithm combines the previous algorithms in two stages.In the first stage, the source and target task are optimized jointly, as in Eq 3.Then, the second stage trains using only the target task, as in Eq 2.
Even though model selection in multitasking is generally done w.r.t.multiple source and target tasks (Caruana, 1994), we modify the setting to validate a model on a single target task at a time.This allows hyperparameter search and early stopping to be controlled by the desired target task.

Experiment Setup
To study task transfer on FETA, we run extensive experimentation.We utilize three task transfer algorithms: pre-train/fine-tune, multitask, and multitask/fine-tune, as described in Section 4. To draw broad conclusions about the performance of each learning algorithm, we utilize pretrained language models with three different architectures: encoder-only (BERT) (Devlin et al., 2019), decoder-only (GPT-2) (Radford et al., 2019), and encoder-decoder (T5) (Raffel et al., 2020).Implementation details, including hyperparameters and prompts, can be found in Appendix B.
A complete experiment for a single target task, T , is as follows: First, we directly fine-tune on T to get the baseline score.Then, for each source task, S, we take the model pre-trained on S and fine-tune on T .Next, we jointly train on S and T together.Finally, we fine-tune the jointly trained model on T .
In addition to the single-source setting described above, we also consider a subset of tasks to study in the multi-source setting, where multiple tasks are simultaneously used as source tasks to transfer to a single target task (6.2).For our experiments, we select two target tasks from each dataset that benefit the most from task transfer, and we use the three source tasks that transferred best onto those targets.
6 Results and Analysis

Single-Source Setting
Table 2 shows the results for all three models and algorithms, and we use this table to understand general trends.Figure 3 shows the relative improvement of a source task for each target task, demonstrating trends across tasks.
Aggregate Performance We find that, on average, Friends tasks get scores between 7-8 points less than DailyDialog, likely due to the greater number of speakers and utterance length of Friends.We find that GPT-2 lags behind the raw scores of BERT and T5 by ∼10 points.This is expected as autoregressive decoder models are not designed with classification in mind.We find that the largest average ∆ is 1.4, leaving room for improvement in task transfer on FETA.
Furthermore, we are interested in knowing: how much we would gain by using the best source task vs. a random source task.We calculate the differences between average ∆ and top-1 ∆ and find the mean difference to be ∼1.6 and the largest difference to be ∼3.5, motivating a further understanding of which source tasks transfer best to target tasks.
Performance Across Learning Algorithms We average scores across both datasets and find that pre-train/fine-tune gets an average score of 42.85, multitask 42.84, and multitask/fine-tune 44.07.Table 2 shows that multitask/fine-tune achieves the best average score for all models and datasets, and indeed its average score is a 2.8% improvement over the other algorithms.However, aggregate scores obscure some interesting nuances.Looking at an individual column can demonstrate best source tasks for that target.Looking at rows can determine which source task works well across multiple targets.(Pruksachatkun et al., 2020;Lourie et al., 2021;Aribandi et al., 2022), but we find that trends vary depending on the model.For example, we find results similar to Lourie et al. (2021), namely, that fine-tuning on the target task always benefits the T5 model.However, we discover that this does not hold for BERT and GPT-2, which achieve better scores from multitasking than pre-train/fine-tune.

Do Trends Vary Across Models? Previous studies on task transfer have focused on a single model
Furthermore, Figure 3 shows that trends on individual tasks also vary depending on the model.For example, T5 positively transferred knowledge to question answering with all learning algorithms and from most source tasks, while GPT-2 had a negative transfer from all algorithms and sources.
For nearly all dimensions of analysis (e.g., sample sizes, learning algorithm), we find different trends between models.We strongly suggest that future research be performed on multiple models before attempting to draw broad conclusions on transfer learning.
Multitask/Fine-tune As Regularization We find that T5's top-1 score and ∆ on DailyDialog are highest for pre-train/fine-tune, but the average score and ∆ are highest for multitask/finetune.To understand why this occurred, we find the bottom-1 scores for T5 on DailyDialog: 46.78, 46.69, and 48.26 for pre-train/fine-tune, multitask, and multitask/fine-tune algorithms, confirming that multitask/fine-tune does achieve the best worstcase performance.Moreover, we find that for all datasets and models, multitask/fine-tune does achieve the best worst-case performance.In fact, for GPT-2 on Friends, utilizing the bottom-1 source tasks still lead to a 0.74% improvement over the baseline.
Do All Task Types Benefit Equally?We find that span extraction tasks gain the most as target tasks, shown in Figure 4 to benefit at all sourceto-target sample ratios.Multiple choice tasks also stand to gain from task transfer, but we find that only occurs at a 10:1 ratio of source-target samples.This gain is likely due to the high-level language understanding required by both tasks.
Additionally, we find that utterance-level classification tasks decrease in score ∆ at increasing source-to-target sample ratios.This is possibly due to models overfitting to specific tasks and a catastrophic forgetting of general skills learned during their large-scale pre-training.Do All Task Types Give Equal Benefit?We find that multiple-choice tasks give the greatest benefit as source tasks, especially when the ratio of source- to-target samples is low, as shown in Figure 9 in the Appendix.Additionally, we find that at a ratio of 10:1 source-target samples, dialogue-level classification benefits downstream tasks, but utterancelevel classification requires a ratio of 100:1.
How Do Sample Sizes Affect Transfer? Figure 5 shows that, interestingly, GPT-2 and T5 have opposite trends in relation to sample size.We find that ∆s for GPT-2 increase with high target samples and decrease with high source samples.This suggests that GPT-2 may be overfitting to the source task and performs better with resource-rich target tasks.We find that T5 ∆s decrease as target-task samples increase, suggesting that T5 is more sample efficient than both GPT-2 and BERT.

Multi-Source Setting
For multi-source transfer we select the two target tasks from each dataset with the best score differences from the single-source setting, shown in Figures 7 and 8 in the Appendix.We find those four tasks to be Dialogue Reasoning Span Extraction (DRSE), Dialogue-Level NLI (DNLI), Character Identification (CI), and Question Answering (QA).For each of these target tasks, we select the top-3 best source tasks, shown in Table 6 of the Appendix .Learning in this setting is similar to single-source, except we now simultaneously optimize the loss for multiple source tasks.Table 3 shows the multi-source results compared with the average score of the top-3 source tasks from the single-source setting.Full results, including score ∆s from the single-source baselines, average top-3 score ∆s, and multi-source score ∆s are in Table 6 of the Appendix.

Does Multi-source Improve Over Single-source?
We expect that by utilizing the top-3 source tasks from the single-source setting, the multi-source setting will improve performance for all models and algorithms, but find results to the contrary.We find that 6/9 multi-source algorithms outperform their average top-3 single-source counterparts in DRSE, 6/9 for DNLI, 3/9 for CI, and only 2/9 for QA, showing that naively combining source tasks is not always beneficial.The impressive result for DRSE follows our original intuition, given that there is an almost unanimous benefit from all source tasks, shown in Figure 3. Similarly, we find that multisource performance on CI also correlates with the performance of individual source tasks.We find that in the single-source setting GPT-2 is the only model that improves with any source task, and indeed GPT-2 sees benefits from multi-source training on all algorithms.
Which Models Benefit From Multi-Source?Table 6 shows that GPT-2 improves in 8/12 experi- Table 3: Multi-source score ∆s from the average score of the top-3 source tasks.Full results, including score ∆s from the fine-tuned baseline are in Table 6.
ments over its average top-3 single-source counterparts, but BERT only 5/12 and T5 in only 4/12 experiments.It is counter-intuitive that T5 should perform the worst as we expect that it has a higher capacity for learning due to twice the model size.
On the other hand, the additional parameters may be causing T5 to overfit on training data in the few-sample setting.

Conclusion
We introduce FETA, a comprehensive benchmark for evaluating language models and task transfer learning algorithms in open-domain dialogue with few samples.Through extensive experimentation, we find new and non-intuitive insights on the mechanisms of transfer learning.In particular, we find that most trends are model-specific, and we strongly encourage researchers to consider multiple model architectures before attempting to draw broad conclusions on transfer learning.It is our hope that FETA enables further research not only in task transfer, but also in other learning settings, and in the generalizability and efficiency of model architectures and pre-training datasets.

Limitations
A concern regarding any work that includes largescale experiments with large language models is the energy consumption and environmental impact, the current work included.While there is a cost to running these experiments, the goal of this work is to improve sample efficiency in the future and we hope that the benefits in future energy saved will outweigh the up-front costs of discovering efficient methods.
Another concern of a large-scale benchmark is that of accessibility.A benchmark requiring too many resources will limit those who can reasonably compete.For this reason and others, in addition to our large-scale benchmark we also include a smaller multi-source setting which requires only 4 experiments to be run for a single model and algorithm, rather than 132 in the single-source setting.We believe this smaller setting will maintain the ability to extract high-quality insights on task transfer, yet allow for increased community access and reduce the carbon footprint of this benchmark.
While we do control for domain adaptation in our experiments on task transfer, there are some aspects that we cannot control.For example, each model has done language model pre-training with a different corpus.BERT was trained on English Wikipedia and BookCorpus (Zhu et al., 2015), GPT-2 was trained on a WebText (Radford et al., 2019), and T5 was trained on C4 (Raffel et al., 2020).This difference likely affects model performance on the dialogue tasks in FETA.
Additionally, we cannot exhaustively test every language model, but still try to provide enough variety in order to draw broad conclusions on task transfer.For example, we don't run any experiments on language models pre-trained in the dialogue domain or language models larger than basesized.We expect that both of these changes would improve raw performance on FETA.More importantly though, it is unclear whether either of these changes would lead to improved task-transfer performance (average and top-1 ∆s) and we leave this exploration for future work.
Furthermore, we cannot exhaustively test all learning algorithms.For example, Wang et al. (2020) propose a transfer learning method that minimizes negative task interference via meta-learning for multilingual models, Albalak et al. (2022) propose a policy-guided algorithm for task transfer in low-data settings, and Yu et al. (2020b) propose an optimization algorithm that mitigates gradient interference for reinforcement learning agents.
Finally, we stress the importance of intra-dataset task transfer in this work.However, this limits the number of pre-annotated tasks that are available, and there are certainly some tasks which we were not able to accomodate in FETA.
was supported by the National Science Foundation award #2048122.The views expressed are those of the author and do not reflect the official policy or position of the US government.Finally, we thank the Robert N. Noyce Trust for their generous gift to the University of California via the Noyce Initiative.RECCON Poria et al. (2021) introduce the task of recognizing emotion causes in conversation and provide annotations for two subtasks: causal emotion span extraction and causal emotion entailment.Recognizing the cause behind emotions is an important aspect of developing conversational agents that can respond appropriately and these tasks test that ability.Both tasks assume that the emotion of an utterance is already known and require a model to identify the evidence or cause of the given emotion.In causal emotion span extraction, the model is given input as "The target utterance is <U t >.The evidence utterance is <U e >.What is the causal span from evidence in the context that is relevant to the target utterance's emotion <E t >?".On the other hand, if the conversation history up to utterance U t is H(U t ), then the task of causal emotion entailment is to classify the triple (U t ,U e ,H(U t )) as entailment or not entailment.In this case, entailment means that the emotion expressed in the target utterance, U t , is caused by the evidence utterance, U e .
CIDER Ghosal et al. (2021) provide annotations for four tasks designed to explore commonsense inference and reasoning in dialogue: dialogue-level natural language inference (DNLI), dialogue rea-soning span extraction, dialogue reasoning multiple choice, and commonsense relation extraction.These tasks are created by annotating knowledge triplets on 31 relations that are either explicitly stated in the dialogue or that require commonsense reasoning using contextual information.In DNLI, the task is to determine whether a triplet is true or false given the dialogue.Given a knowledge triplet as <head, relation, tail>, the span extraction task is formulated as identifying the tail when given the head, relation, and dialogue for context.The multiple choice task is motivated by the SWAG commonsense inference task (Zellers et al., 2018), given a head, relation, and conversation as context, the goal is to predict the tail of the relation from 4 possible choices.Finally, commonsense relation extraction is formulated as usual relation extraction tasks; given the head, tail, and conversation as context, the goal is to predict the correct relation out of 31 options.
DailyDialog++ Sai et al. ( 2020) present the Dai-lyDialog++ dataset, where they aim to improve evaluation of response generation.They do so by collecting five relevant responses and five adversarially crafted irrelevant responses for each dialogue in their dataset, and we recycle their data for a new task called adversarial response selection.Adversarial response selection is formulated as a multiple choice selection between a correct response, a randomly selected negative response, and an adversarial negative response.

A.2 Friends
EmoryNLP Chen and Choi (2016) and Zhou and Choi (2018) provide annotations for character identification, a subtask of entity linking, where entity mentions in an utterance need to be matched to their correct entity.For this task there are seven possible entities: the six main characters and an "other" entity.
Zahiri and Choi (2018) provide annotations on emotion recognition, with the 7 fine-grained emotions from the Feeling Wheel (Wilcox, 1982).Ma et al. (2018) present annotations for a subtask of reading comprehension, called passage completion.In passage completion, given a dialogue and factual statement about the dialogue where character mentions are removed, the task is to fill in the blanks with the correct character from the dialogue.This task is similar to a multiple choice task because entity choices are presented to the model, but because there are varying number of options in each dialogue, it is formulated as a span extraction that is evaluated based on accuracy.Yang and Choi (2019) introduce annotations for question answering.The answers to questionanswer pairs can either be a speaker name or exist as a span within the dialogue, and multiple spans may be correct.Jiang et al. (2020) present the personality detection task by annotating speakers with five traits: agreeableness, conscientiousness, extraversion, openness, and neuroticism.The goal of the task is to correctly identify whether a given character from a dialogue either has or does not have each of the five traits.

B Implementation Details
For our experiments, we use the pretrained model implementations from the HuggingFace Transformers library (Wolf et al., 2020), where the bert-baseuncased model has 110M parameters, GPT-2 has 124M parameters, and T5-base has 223M parameters.We use the Adam optimizer (Kingma and Ba, 2015) with a batch size of 60 and run a learning rate sweep across {3×10 -6 , 1×10 -5 ,3×10 -5 ,1×10 -4 } during the pre-training phase, finding that 3×10 -5 worked well across all models.In all experiments we utilize validation-based best model selection, and train models for 30 epochs on DailyDialog tasks and 20 epochs on Friends tasks.

C Expanded Single-Source Results
10949

Figure 1 :
Figure 1: Task Transfer Performance on FETA-DailyDialog.Computed transfer performance is demonstrated by arrows leaving from source tasks and entering target tasks.Strength of the transfer is denoted by thickness and color of edges.

Figure 3 :
Figure 3: Relative improvement of transfer over fine-tuned baselines.Rows are source tasks and columns are target tasks.Diagonal cells are baseline scores.Looking at an individual column can demonstrate best source tasks for that target.Looking at rows can determine which source task works well across multiple targets.

Figure 4 :
Figure 4: Score ∆ by target task type.Lines show the average score ∆ when the target task is of the specified task type, computed as a best-fit linear interpolation of the data with a 95% confidence interval.The number of samples for an individual task are fixed, but source/target ratios vary depending on which task pair is used.

Figure 5 :
Figure5: Score ∆ by sample count.Sample count is on the x-axis (log scale) and score ∆ is on the y-axis.The blue dotted line represents the average transfer ∆ from a source task to all target tasks.The brown line represents the average transfer ∆ to a target task from all sources.Trend lines are a linear best-fit on the data with a 95% confidence interval.The number of samples for an individual task are fixed, but source/target ratios vary depending on which task pair is used.

Figure 6 :
Figure 6: Utterance and dialogue length distributions in FETA.

DialogRE
Yu et al. (2020a) introduce a relation extraction dataset annotated with 36 different relations.Their dataset anonymizes speakers which allows for an entity linking relation called "per:alternative_name".However, our version of the Friends dataset is named and so we remove this relation from our data.This task is similar to the relation extraction from DailyDialog, however the relations in DailyDialog are commonsense relations, and the relations in Friends are focused on information about entities.MELD Poria et al. (2019) provide additional annotations for emotion recognition, with only 22.2% dialogue overlap with Zahiri and Choi (2018)'s dialogues.Additionally, while both use 7 total emotions, Poria et al. (2019) use 2 different emotions from Zahiri and Choi (2018).

Figure 8 :
Figure 8: Aggregate task transfer performance on Friends.

Figure 9 :
Figure 9: Score ∆ by source task type.The number of samples for an individual task are fixed, but source/target ratios vary depending on which task pair is used..

Table 1 :
Overview of FETA tasks.Task types are abbreviated as follows: Utt Cls for utterance-level classification, Dial Cls for dialogue-level classification, Span Ex for span extraction, and Mult Ch for multiple choice.

Table 2 :
Fine-tune 51.40 (0.25) -0.15 52.76 +1.22 44.69 (0.28) +1.41 46.00 +2.72 Average and Top-1 Source task transfer scores.Average scores and ∆s aggregate scores over all source tasks, compared with Top-1 scores and ∆s which are calculated with scores from the highest performing source task.∆s are the difference from the baseline score without task transfer.Highest values for each model are underlined, highest values across all models are bolded.

Table 4 :
Prompts for FETA-DailyDialog tasks.All prompts start with "context: <context>", but we leave this out due to repetitiveness and space.

Table 5 :
Prompts for FETA-Friends tasks.All prompts start with "context: <context>", but we leave this out due to repetitiveness and space.