Analyzing Multi-Task Learning for Abstractive Text Summarization

Despite the recent success of multi-task learning and pre-finetuning for natural language understanding, few works have studied the effects of task families on abstractive text summarization. Task families are a form of task grouping during the pre-finetuning stage to learn common skills, such as reading comprehension. To close this gap, we analyze the influence of multi-task learning strategies using task families for the English abstractive text summarization task. We group tasks into one of three strategies, i.e., sequential, simultaneous, and continual multi-task learning, and evaluate trained models through two downstream tasks. We find that certain combinations of task families (e.g., advanced reading comprehension and natural language inference) positively impact downstream performance. Further, we find that choice and combinations of task families influence downstream performance more than the training scheme, supporting the use of task families for abstractive text


Introduction
Self-supervised learning has been a significant success driver for generating high-quality abstractive summaries (Devlin et al., 2019;Liu et al., 2019b;Cohen and Gokaslan, 2020;Lewis et al., 2020;Raffel et al., 2020;Radford et al., 2019).Through self-supervision, language models implicitly learn intrinsic language features (e.g., syntax) from unlabeled data that they can use to solve downstream tasks (Brown et al., 2020).However, skills necessary to perform specific tasks often can be learned from an existing set of labeled data, requiring fewer training iterations (Rajpurkar et al., 2016;See et al., 2017).For example, to perform text summarization, a helpful skill is the ability to answer questions 1 https://github.com/FKIRSTE/GEM_emnlp2022-TOASTS about texts (Rajpurkar et al., 2016).
The multi-task learning paradigm and its variations aim to acquire multiple skills simultaneously to succeed on the downstream tasks, e.g., T5 (Raffel et al., 2020), and are independent of a specific training stage (Aribandi et al., 2021).While studies on the effects of multi-task learning on a large scale exist (Aghajanyan et al., 2021;Sun et al., 2020;Aribandi et al., 2021) and are evaluated on broad natural language understanding benchmarks (Wang et al., 2019), they are lacking insight on the influence on abstractive text summarization.Furthermore, multi-task learning approaches are diverse in their methods (e.g., training scheme, mixing strategy, task families), hampering their comparison.
In this work, we investigate the role of multi-task learning on English abstractive text summarization.Therefore, we organize 18 pre-selected training tasks into six higher-level, modular task families.Further, we compare three training schemes for the pre-finetuning stage and their respective mixing strategies through changes of multiple scores.
Our experiments show that families' choice significantly impacts text summarization, while different training schemes have little influence.Moreover, pairing a text summarization task family with any other helps to stabilize the overall performance when transferring to unknown data.In some cases, we also found that a text summarization task family can be substituted by other family pairs, e.g., advanced reading comprehension and classification.
To summarize our contributions: • We study the influence of multi-task learning by training models on six task families for the English abstractive text summarization task.
• We compare the influence of three training schemes (i.e., sequential, simultaneous, continual multi-task learning) and two mixing strategies (i.e., proportional, equal).

Related Work
Multi-task learning and pre-finetuning.Transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) are trained using a two-step approach, the pre-training on large unlabeled corpora and the finetuning on a smaller, more specific (and usually labeled) downstream corpus.This bilateral approach allows language models to obtain general text representations once to perform many NLP downstream tasks with few gradient steps (e.g., document classification (Ostendorff et al., 2020a,b), plagiarism detection (Wahle et al., 2021(Wahle et al., , 2022b,c),c), media bias detection (Spinde et al., 2021(Spinde et al., , 2022))).However, pre-training is typically highly computationally expensive and requires dedicated ample infrastructure; few researchers can reproduce the pre-training of large language models.Therefore, recent works (Phang et al., 2018;Aghajanyan et al., 2021)) proposed additional training stages between pre-training and finetuning, i.e., pre-finetuning2 .
ERNIE 2.0 (Sun et al., 2020) proposes continual multi-task learning, in which tasks are trained incrementally, thereby building a queue of introduced tasks that re-appear throughout the training process, to counter catastrophic forgetting (McCloskey and Cohen, 1989;Kirkpatrick et al., 2017).MUPPET (Aghajanyan et al., 2021) and ExT5 (Aribandi et al., 2021) follow a simultaneous approach, drawing heterogeneous batches from multiple tasks and massively scale their training to >50 and >100 tasks respectively.MT-DNN (Liu et al., 2019a) organizes the prediction layer of a Transformer into four task families of common tasks of the GLUE benchmark (Wang et al., 2018) and learns each task sequential with their task order randomized.This study compares continual multi-task learning, simultaneous training, and sequential training for abstractive text summarization.
Task selection and relationship.Vu et al. (2020) conduct an empirical investigation on 33 tasks across three broad groups (i.e., text classification, question answering, and sequence labeling) to explore their inter-and intra-group training for different group sizes.Their experiments suggest that positive transfers between task groups are possible when the source dataset is small, and intergroup transfers are sensitive to group sizes.ExT5 (Aribandi et al., 2021) analyzes the correlation of task family representatives and shows, that summarization tasks (i.e., CNN/Daily Mail (See et al., 2017), XSum (Narayan et al., 2018), WikiLingua (Ladhak et al., 2020)) generally reduce performance on most other task families and that CBQA tasks (i.e., Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017), Hotpot QA (Yang et al., 2018)) are sensitive to multi-task learning.For the task relationship and transfer analysis, Aribandi et al. (2021) train on two families simultaneously and evaluate the first one.We expand the study of Aribandi et al. (2021) by adapting task families and respective representative tasks to be related to the text summarization task (Section 3.1), considering different family combinations, training approaches (Section 3.2), and tracking their performance through additional metrics for different unseen datasets (Section 4).
Multiple works leverage algorithms for the selection of training tasks, e.g., Ruder and Plank (2017) use Bayesian Optimization to learn similarity measures (i.e., Jensen-Shannon divergence (Lin, 1991) and Rényi divergence (Rényi et al., 1961)) and a Beta-Bernoulli multi-armed bandit with Thompson Sampling (Russo et al., 2018;Thompson, 1933) is used by AutoSem (Guo et al., 2019).Conversely, ExT5 (Aribandi et al., 2021) does not rely on automatic training task selection approaches as described by the preceding works and instead chooses an empirical approach to select tasks for higherlevel task families.We follow the approach of Aribandi et al. (2021)'s task representative selection when choosing our tasks as the training task correlation analysis in ExT5 indicates which families could positively influence text summarization.

Methodology
We name our study TOASTS, a Task-Oriented AnalysiS for Text Summarization to investigate the effects of different task family combinations on English abstractive text summarization via a multi-task learning architecture.TOASTS groups selected pre-training tasks into task families and explores the correlation of these families, their in-  et al., 2021).Pre-finetuning has two main parts: the task family setup and the training strategies.
The task family setup groups different tasks and related datasets into broader families according to their primary objective.The tasks of these families are then combined following a training strategy and evaluated into a final task. Figure 1 illustrates the components of TOASTS, which are detailed in the following sections.

Task family setup
Selection.A myriad of NLP downstream tasks (e.g., word sense disambiguation and paraphrase detection) can be considered when choosing a multi-task architecture.Without computational limits, one could explore all possible permutations of tasks and the influence of the respective tasks on downstream performance.Unfortunately, as the number of tasks grows by more than their factorial number, joint training becomes computationally prohibitive (Aribandi et al., 2021).Therefore, we organize tasks into six high-level families (Aribandi et al., 2021;Brown et al., 2020) and perform combinations on their family levels: classification (CLS), commonsense reasoning (CMNS), and natural language inference (NLI), reading comprehension (RC), advanced reading comprehension3 (RC + ), summarization (SUM).We compose each task family of three datasets that tackle different aspects of the problem, as shown in Table 1.
The selected tasks in TOASTS should not be seen as an exhaustive list of all NLP downstream tasks; instead, they should be considered an educated selection to measure task family influence on text summarization.An extended list of planned tasks for future analyses can be found in Table 7 in Appendix A.
Task mixing.After pre-selecting representative tasks for each family, we control the percentage of data ingested from each task using a task mixing strategy.We consider two methods for processing all combinations of task families: proportional mixing (Sanh et al., 2019;Aribandi et al., 2021) and equal mixing (Raffel et al., 2020).Equal mixing picks training samples from each task with equal probability, while proportional mixing sets the probability to the proportion of each task's size.The use of proportional mixing as a default strategy is the recommended approach for various multitask learning strategies (Sanh et al., 2019).However, continual multi-task learning (Section 3.2) requires an equal mixing strategy even though related studies have shown it to be sub-optimal (Raffel et al., 2020).While we sample either proportional or equal within task families, we draw equal between task families to balance the influence of potentially different task families.We leave to future work the investigation of the effects of different amounts of tasks and samples per family.

Training strategies
Training Schemes.Multi-task learning during a pre-finetuning stage allows us to start from a pre-trained checkpoint, decreasing the final task's overall cost.We explore three multi-task learning training schemes for the pre-finetuning as Figure 2 shows: sequential learning (seq) (McCloskey and Cohen, 1989;Biesialska et al., 2020), simultaneous learning (sim) (Caruana, 1997;Aghajanyan et al., 2021), and continual multi-task learning (cMTL) (Sun et al., 2020).In the sequential approach, training batches are composed of a single dataset, i.e., homogeneous batches, and their processing order is sequentially randomized (Liu et al., 2019a).This approach achieves a concentrated task learning on the batch level while keeping the overall variety, therefore learning a task more thoroughly before moving to the next.For the simultaneous strategy, we combine all tasks into a single pool and draw randomly from it (Aghajanyan et al., 2021;Aribandi et al., 2021).This prominent approach introduces task variety on the batch level by constantly challenging the model with different approaches, forcing it to identify intrinsic commonality between the task families quickly.For continual multi-task learning, we adjust the concept of ERNIE 2.0 (Sun et al., 2020) to adapt it to our task family configuration.As our tasks corpus is less extensive than the training dataset used in ERNIE 2.0, we have to rejig the number of stages and train-ing steps in TOASTS.Therefore, when including new tasks and task families, we change their total number of steps to 9k, and 27k, respectively, as Table 2 shows.One difference from ERNIE 2.0 is that once a new task is introduced to the pipeline and trained for the first time at timestep t, we move it to the end of the queue of previously trained tasks as the last one to be executed in t + 1.Using the order in (Sun et al., 2020) as an alternative way of including and carrying new tasks, yields worse results (Table 8).Through the pre-determined task order of this approach, we can control which task families follow each other and how fundamental a task is by introducing it earlier than others.

Experimental setup
Model.For all experiments, we use BART-Large (Lewis et al., 2020) to probe combinations of task families, mixing, and training strategies in  TOASTS.BART is a two-stage denoising autoencoder that corrupts its input text and reconstructs it through a sequence-to-sequence model.We chose BART because of its ability to perform a wide range of downstream tasks, such as paraphrase detection (Wahle et al., 2022b), fake news identification (Wahle et al., 2022a), and text summarization (Lewis et al., 2020).Additionally, in our preliminary experiments, BART also performed better than other candidate models such as PEGASUS (Zhang et al., 2020) and T5 (Raffel et al., 2020) (comparison in Tables 9 and 10 in appendix B).
Tokenization.We tokenize text using the BART-Large tokenizer and augment all texts to include task-specific prompts such as 'question:' or 'context:'.Further, we structure the samples to follow a uniform text-to-text style which allows the model to handle multi-task learning across different task families without needing task-specific losses, loss scaling or explicit gradient accumulation on heterogeneous batches (Liu et al., 2019a;Aghajanyan et al., 2021).
Hyperparameters.We run our experiments on 8 NVIDIA A100s with a total of 320GB GPU memory.The models are trained with a total batch size of 8 for three epochs and up to 60k global steps for six task families during pre-finetuning (finetuning: 16k for Reddit TIFU, 70k for arXiv) with half-precision (fp16 The arXiv dataset consists of 250K scientific articles with the task of deriving the abstract from the full text.These datasets are commonly referred to as challenging abstractive summarization tasks (Zhang et al., 2020;He et al., 2020).In combination, they provide a balanced landscape as Reddit TIFU contains shorter examples with an average of 432 words per post and 23 per summary, relying on simpler linguistic, and arXiv longer examples with 4938 words per document and 220 per summary constructed from elaborated text.
During our experiments, we consider a combination of count-based and semantic metrics to assess the quality of produced summaries.We use BLEU (Papineni et al., 2002), ROUGE (1, 2, L) (Lin, 2004), and METEOR (Banerjee and Lavie, 2005), which favor precision, recall, and harmonic mean, respectively.Even though these traditional metrics can work well for similarly worded sum- maries, they are limited when wording changes, but the semantic meaning remains the same (Bhandari et al., 2020;Huang et al., 2021).To assess semantic similarity better, we also include BERTScore (Zhang et al., 2019a), a similarity measure that maximizes the cosine similarity between candidate and reference contextualized token embeddings via BERT (Devlin et al., 2019) in a greedy manner.

Experimental results and discussion
We structure our experiments into four research questions, which tackle the relevance of task families and dataset compatibility (RQ1), the effects of co-training text summarization task families with other families (RQ2), the co-training of task families excluding text summarization (RQ3), and the co-training of text summarization and two different task families (RQ4).
We pre-finetune our baseline model (BART-Large) for each experiment on specific task families (e.g., CLS, CMNS) and evaluate the resulting models into the Reddit TIFU and arXiv datasets.Tables 3  to 6 show the different task mixing and training strategies.Sequential (seq) and simultaneous (sim) training strategies use proportional mixing, while continual multi-task learning (cMTL) uses equal mixing.Because of space constraints, we report our results only for the METEOR metric, which proved to be the most sensitive to our experiments.We include a complete list of results for BertScore, BLEU, METEOR, and ROUGE (1, 2, L) in Appendices C.1 and C.2.
RQ1: Does increasing the number of pre-finetuning datasets increase downstream task performance for text summarization? A. To identify if the text summarization down-stream task benefits from unconstrained usage of multiple task families, we compare how each task family performs against the combination of all.
As Table 3 shows, the SUM task family consistently outperforms the combination of all families for both datasets (followed by RC), except for the sim training scheme on arXiv.The increase in performance through pre-training SUM is somehow expected, as it is the most related task family to the actual problem, i.e., abstractive text summarization.Conversely, NLI performs the worst when compared to any other task family.Pre-finetuning generally positively affects BART compared to its baseline, except for a few cases (e.g., cMTL-RC + , NLI).Overall, the sim training strategy greatly influenced downstream task performance.
Our results suggest that combining all task families is suboptimal for text summarization, which challenges recent observations for other NLP tasks (Aghajanyan et al., 2021;Aribandi et al., 2021).Also, increasing the number of task families requires high compute budgets.As we train each task family individually or all simultaneously, it is unclear how much influence a summarization task family (e.g., SUM) has on the others.

RQ2: How much does the text summarization task affect other task families?
A. As SUM is closely related to the text summarization task, and it yields the best results in RQ1, we explore how its combination with another task family affects the resulting model.Table 4 shows the results of combining SUM with other task families.Aside from a few cases (e.g., arXiv sim for SUM+RC + ), pairing with the SUM family improves over almost every single run in Table 3 and the combination of all task families.While some task families' combinations obtain small benefits (seq-SUM+RC), others are greatly affected (e.g., cMTL-SUM+CMNS) for both datasets.The BART baseline performs better than the pre-finetuning in only two cases, i.e., SUM+CLS for Reddit TIFU (cMTL) and SUM+CMNS for arXiv (seq).We observe fewer outliers with low scores when pairing SUM with other task families than in RQ1.Individual training improved the performance on arXiv the most (seq and sim), while for Reddit TIFU, the combination of task families was more effective (seq and cMTL).
Low scores are also less frequent when combining task families with one exception, i.e., cMTL-SUM+CLS for Reddit TIFU.The lowest scores in RQ1 (e.g., NLI, CMNS) and RQ2 (CLS) might be related to the fact that these tasks are not contributing to the learned weights of the downstream task.
As Reddit TIFU uses mostly informal language and its input sequence and summaries are short, this might justify these low scores.
The improvements in Table 4 over the BART baseline are likely to be related to the SUM family rather than a mixing strategy or training scheme.
The results of individually training the SUM family (RQ1) are equal or marginally higher when combined with other task families (e.g., 0.233 for SUM+RC vs. 0.231 SUM).As the SUM family seems to substantially impact co-training multiple tasks, we are interested in evaluating the influence of families other than SUM.

RQ3: How do non text summarization task families influence each other?
A. We remove the SUM family and co-train all possible pairs of task families.5. We assume the stability provided by SUM would also be present in the inclusion of more task families.Further, we observe the positive influence of RC and RC + when pairing three task families excluding SUM (Tables 26 to 28).

Conclusion & Future Work
In this work, we studied the influence of multi-task learning combinations of task families during the pre-finetuning stage for English abstractive text summarization.We trained three different training strategies, six task families composed of 18 tasks, and evaluated two downstream tasks.
Our experiments show that non text summarization task families, e.g., advanced reading comprehension, can be used as a substitute for the summarization task (RQ2) or the combination of all task families (RQ1).However, including the summarization task family in the training process positively impacts the downstream performance compared to non text summarization family combinations.Further, our analysis shows that training strategies have little influence on the overall performance compared to the task family selection.
We see this analysis as the first step to understanding training strategies and task families for text summarization.In the future, we want to investigate more tasks (both in number and diversity) per task family, training schemes, and mixing strategies.We also plan to include psychological studies comparing the similarities of textual understanding tasks as a starting point for task family preselection.

Limitations
With the organization of tasks and datasets into task families, this study highly depends on these representative tasks' domain and expressiveness.
As Aribandi et al. (2021) faced similar problems, we followed their guidance to select representatives to consist of a diverse set of datasets to train and evaluate on and to partition task families as mutually exclusive as possible while being related to abstractive text summarization.However, none of the datasets are perfectly isolated and can only be used as a proxy for a larger task family.

Ethical Considerations
This study depends on existing resources and generative models; thus, it is not free of biases and possible ethical considerations.One problem is the generation of text summaries that contain nonfactual information, meaning distortion, social biases such as political stances, or abusive language (Gooding, 2022).To mitigate these problems we plan to condition the generation of trained models for unsafe content or other harmful text to return an empty string.
Furthermore, TOASTS is licensed to the public under a copyright policy that allows unlimited reproduction, distribution, and hosting on any website or medium.Hence, anyone can exploit its limitations and inherited biases to propagate and amplify unintentional societal problems.

A Tasks and Families
Table 7 shows an extended version of prefinetuning tasks in Table 1 to-be-considered in future work

B Additional Models
Tables 8 to 10 shows the results for different models and loop orders.BART performed best compared to models from related work, which is why we chose the model throughout our experiments.

C Extended Results
C.1 Extended Results on Reddit TIFU Tables 13 to 27 show the detailed evaluation for each research question and all tested combinations of task families evaluated on the Reddit TIFU datasets.The tables are divided according to their training scheme, i.e., each table shows one of the three training schemes (sim, seq, cMTL).

C.2 Extended Results on arXiv
Tables 31 to 39 show the detailed evaluation for each research question and all tested combinations of task families evaluated on the arXiv datasets.
The tables are divided according to their training scheme, i.e., each table shows one of the three training schemes (sim, seq, cMTL).

D Hyperparameters
Table 41 shows the hyperparameters used throughout the pre-finetuning and finetuning experiments.

Figure 1 :
Figure 1: The central architecture of TOASTS.The intermediate training phase commences the task family setup (left) by organizing the pre-selected training tasks into families of similar problems and applying two (proportional, equal) intra-family mixing strategies.The training strategies (right) continue by processing and organizing the generated task families into batches according to one of three training schemes (sequential, simultaneous, continual multi-task learning).After pre-finetuning BART, the resulting model is finetuned and evaluated on two abstractive text summarization datasets (Reddit TIFU, arXiv).The training/mixing scheme pairings are marked by the background colors green and blue .
Continual multi-task learning.

Figure 2 :
Figure 2: TOASTS's three training strategies.(a) Sequential learning (seq) draws a batch with samples from one task of a task family at a time for every training stage.The order of tasks is randomized.(b) Simultaneous learning (sim) samples from all available tasks at the same time.(c) Continual multi-task learning (cMTL) introduces a new task in each training stage, which is added to the end of the training queue.

Table 1 :
Our selection of 18 representative datasets organized by their task family.For every dataset, we list the target task, the source, and the characteristics of the data.For a complete list of tasks, please see Appendix A.

Table 2 :
The number of batches during cMTL training depends on the training stage and the number of intro-

Table 3 :
Results (METEOR)for single task families and the combination of all task families for the Reddit TIFU and arXiv datasets.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent from training.† Repeated result for baseline without training scheme.

Table 4 :
Results (METEOR)for the combination of SUM and different task families for the Reddit TIFU and arXiv datasets.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.† Repeated result for baseline without training scheme.

Table 5
families (RQ2), we investigate its influence in task family pairs (RQ3) as Table6shows.For this research question, we only consider Reddit TIFU as it provides a more challenging scenario (i.e., informal, short texts) and limits our computational

Table 5 :
Results (METEOR)for the combination of all pairs of task families (except for SUM) for the Reddit TIFU and arXiv datasets.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.† Repeated result for baseline without training scheme.
+ are still the top results when including SUM.As in Table4, the SUM family seems to provide stability to the results, as we see fewer fluctuations than in Table

Table 7 :
An extended list of Table1.This list can be used to extend TOASTS to more tasks and datasets in future work.TF stands for Task Family.

Table 8 :
Results of different loop orders tested.Let t denote the current training stage, then the ascending order for the training stage t is Task t , Task 1 , Task 2 , ..., Task t − 1 .The descending order follows for the same training stage t the form Task t , Task t − 1 , Task t − 2 , ..., Task 1 .

Table 9 :
Results of different models used.The models were finetuned on Reddit TIFU without pre-finetuning and with full precision.Values in bold represent the highest results for a training scheme.

Table 10 :
Results of different models used.The models were finetuned on arXiv without pre-finetuning and with full precision.Values in bold represent the highest results for a training scheme.

Table 11 :
RQ1 results (single task family) for Reddit TIFU and the sequential strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 12 :
RQ1 results (single task family) for Reddit TIFU and the simultaneous strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 13 :
RQ1 results (single task family) for Reddit TIFU and the continual multi-task learning strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 14 :
RQ1 results (all task families) for Reddit TIFU and the sequential strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 15 :
RQ1 results (all task families) for Reddit TIFU and the simultaneous strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 16 :
RQ1 results (all task families) for Reddit TIFU and the continual multi-task learning strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 17 :
RQ2 results (pairing of the summarization task family with another task family) for Reddit TIFU and the sequential strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 24 :
RQ4 results (pairing of the summarization task family with two other task families) for Reddit TIFU and the simultaneous strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 25 :
RQ4 results (pairing of the summarization task family with two other task families) for Reddit TIFU and the contniual multi-task learning strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 26 :
RQ4 results (pairing of three task families excluding the text summarization family) for Reddit TIFU and the sequential strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 27 :
RQ4 results (pairing of three task families excluding the text summarization family) for Reddit TIFU and the simultaneous strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 28 :
RQ4 results (pairing of three task families excluding the text family) for Reddit TIFU and the continual multi-task learning strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 29 :
RQ1 results (single task family) for arXiv and the sequential strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 30 :
RQ1 results (single task family) for arXiv and the simultaneous strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.

Table 31 :
RQ1 results (single task family) for arXiv and the continual multi-task learning strategy.Values in bold represent the highest results for a training scheme.Underlined values are the highest results for that dataset independent of training.