Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Pretrained language models have shown success in various areas of natural language processing, including reading comprehension tasks. However, when applying machine learning methods to new domains, labeled data may not always be available. To address this, we use supervised pretraining on source-domain data to reduce sample complexity on domain-specific downstream tasks. We evaluate zero-shot performance on domain-specific reading comprehension tasks by combining task transfer with domain adaptation to fine-tune a pretrained model with no labelled data from the target task. Our approach outperforms Domain-Adaptive Pretraining on downstream domain-specific reading comprehension tasks in 3 out of 4 domains.


Introduction
Pretrained language models (Liu et al., 2019;Wolf et al., 2020) require substantial quantities of labeled data to learn downstream tasks. For domains that are novel or where labeled data is in short supply, supervised learning methods may not be suitable (Zhang et al., 2020;Madasu and Rao, 2020;Rietzler et al., 2020). Collecting sufficient quantities of labeled data for each new application can be resource intensive, especially when aiming for both a specific task type and a specific data domain. By traditional transfer learning methods, it is prohibitively difficult to fine-tune a pretrained model on a domain-specific downstream task for which there is no existing training data. In light of this, we would like to use more readily available labeled indomain data from unrelated tasks to domain-adapt our fine-tuned model.
In this paper, we consider a problem setting where we have a domain-specific target task (QA) for which we do not have any in-domain training * Equal Contribution data (QA Data in the target domain). However, we assume that we have generic training data for the target task type, and in-domain training data for another task. To address this problem setting, we present Task and Domain Adaptive Pretraining (T+DAPT), a technique that combines domain adaptation and task adaptation to improve performance in downstream target tasks. We evaluate the effectiveness of T+DAPT in zero-shot domainspecific machine reading comprehension (MRC) (Hazen et al., 2019;Reddy et al., 2020;Wiese et al., 2017) by pretraining on in-domain NER data and fine-tuning for generic domain-agnostic MRC on SQuADv1 (Rajpurkar et al., 2018), combining knowledge from the two different tasks to achieve zero-shot learning on the target task. We test the language model's performance on domain-specific reading comprehension data taken from 4 domains: News, Movies, Biomedical, and COVID-19. In our experiments, RoBERTa-Base models trained using our approach perform favorably on domain-specific reading comprehension tasks compared to baseline RoBERTa-Base models trained on SQuAD as well as Domain Adaptive Pretraining (DAPT). Our code is publicly available for reference. 1 We summarize our contributions as follows: • We propose Task and Domain Adaptive Pretraining (T+DAPT) combining domain adaptation and task adaptation to achieve zeroshot learning on domain-specific downstream tasks. • We experimentally validate the performance of T+DAPT, showing our approach performs favorably compared to both a previous approach (DAPT) and a baseline RoBERTa finetuning approach. • We analyze the adaptation performance on different domains, as well as the behavior of DAPT and T+DAPT under various experimental conditions.

Related Work
It has been shown that pretrained language models can be domain-adapted with further pretraining (Pruksachatkun et al., 2020) on unlabeled indomain data to significantly improve the language model's performance on downstream supervised tasks in-domain. This was originally demonstrated by BioBERT (Lee et al., 2019). Gururangan et al. (2020) further explores this method of domain adaptation via unsupervised pretraining, referred to as Domain-Adaptive Pretraining (DAPT), and demonstrates its effectiveness across several domains and data availability settings. This procedure has been shown to improve performance on specific domain reading comprehension tasks, in particular in the biomedical domain (Gu et al., 2021). In this paper, as a baseline for comparison, we evaluate the performance of DAPT-enhanced language models in their respective domains, both in isolation with SQuAD1.1 fine-tuning and in conjunction with our approach that incorporates the respective domain's NER task. DAPT models for two of our domains, News and Biomedical, are initialized from pretrained weights as provided by the authors of Gururangan et al. (2020). We train our own DAPT baselines on the Movies and COVID-19 domains. Xu et al. (2020) explore methods to reduce catastrophic forgetting during language model fine-tuning. They apply topic modeling on the MS MARCO dataset (Bajaj et al., 2018) to generate 6 narrow domain-specific data sets, from which we use BioQA and MoviesQA as domain-specific reading comprehension benchmarks.

Experiments
We aim to achieve zero-shot learning for an unseen domain-specific MRC task by fine-tuning on both a domain transfer task and a generic MRC task. The model is initialized by pretrained RoBERTa weights (Liu et al., 2019), then fine-tuned using our approach with a domain-specific supervised task to augment domain knowledge, and finally trained on SQuAD to learn generic MRC capabilities to achieve zero-shot MRC in the target domain on an unseen domain-specific MRC task without explicitly training on the final task. This method is illustrated in Figure 1.

Datasets
We explore the performance of this approach in the Movies, News, Biomedical, and COVID-19 domains. Specifically, our target domain-specific MRC tasks are MoviesQA (Xu et al., 2020), NewsQA (Trischler et al., 2017), BioQA (Xu et al., 2020), and CovidQA (Möller et al., 2020), respectively. We choose to use named entity recognition (NER) as our supervised domain adaptation task for all four target domains, as labeled NER data is widely available across various domains. Furthermore, NER and MRC share functional similarities, as both rely on identifying key tokens in a text as entities or answers. The domain-specific NER tasks are performed using supervised training data

Methods
We compare our approach (T+DAPT) to a previous approach (DAPT) as well as a baseline model. For the baseline, the pretrained RoBERTa-Base model is fine-tuned on SQuAD and evaluated on domain-specific MRC without any domain adaptation. In the DAPT approach, RoBERTa-Base is first initialized with fine-tuned DAPT weights (NewsRoBERTa and BioRoBERTa) provided by Gururangan et al. (2020) or implemented ourselves using the methodology described in Gururangan et al. (2020)  . These models are initialized by DAPT weights-which have been fine-tuned beforehand on unsupervised text corpora for domain adaptation-from the Hugging-Face model hub (Wolf et al., 2020), fine-tuned on SQuAD, and evaluated on domain-specific MRC.

Results
We compare the effectiveness of our approach, which uses NER instead of language modeling 2 https://github.com/tsantosh7/ COVID-19-Named-Entity-Recognition 3 https://github.com/davidcampos/ covid19-corpus (as in DAPT) for the domain adaptation method in a sequential training regime. Our experiments cover every combination of domain (Movies, News, Biomedical, or COVID) and domain adaptation method (T+DAPT which uses named entity recognition vs. DAPT which uses language modeling vs. baseline with no domain adaptation at all).
Our results are presented in Table 2. We use F1 score to evaluate the QA performance of each model in its target domain. In our experiments, DAPT performs competitively with baseline models and outperforms in one domain (CovidQA). Our T+DAPT approach (RoBERTA + Domain NER + SQuAD) outperforms the baseline in three out of four domains (Movies, Biomedical, COVID) and outperforms DAPT in three out of four domains (Movies, News, Biomedical). We also test a combination of DAPT and T+DAPT by retraining DAPT models on domain NER then SQuAD, and find that this combined approach underperforms compared to either T+DAPT alone or DAPT alone in all four domains. We further discuss the possible reasons for these results in Section 4.

Analysis
Specific domains learn from adaptation: Our approach shows promising performance gains when used for zero-shot domain-specific question answering, particularly in the biomedical, movies, and COVID domains, where the MRC datasets were designed with the evaluation of domainspecific features in mind. Performance gains are less apparent in the News domain, where the NewsQA dataset was designed primarily to evaluate causal reasoning and inference abilitieswhich correlate strongly with SQuAD and base-   (Table 4). When does DAPT succeed or fail: In zero-shot QA, DAPT performs competitively with the baseline in all domains and outperforms in the COVID domain. This builds upon the results of Gururangan et al. (2020), which reports superior performance on tasks like relation classification, sentiment analysis, and topic modeling, but does not address reading comprehension tasks, which DAPT may not have originally been optimized for. Unsupervised language modeling may not provide readily transferable features for reading comprehension, as opposed to NER which identifies key tokens and classifies those tokens into specific entities. These entities are also often answer tokens in reading comprehension, lending to transferable representations between NER and reading comprehension. Another possible factor is that RoBERTa was pretrained on the English Wikipedia corpus, the same source that the SQuAD questions were drawn from. Because of this, it is possible that pretrained RoBERTa already has relevant representations that would provide an intrinsic advantage for SQuAD-style reading comprehension which would be lost due to catastrophic forgetting after retraining on another large language modeling corpus in DAPT.
In the COVID domain, we use the article dataset from Wang et al. (2020). These articles also make the basis for the CovidNER and CovidQA (Möller et al., 2020) datasets, which may explain the large performance improvement from DAPT in this domain. These results suggest that the performance of DAPT is sensitive to the similarity of its language modeling corpus to the target task dataset. 1

Conclusion
We evaluate the performance of our T+DAPT approach with domain-specific NER, achieving positive results in a zero-shot reading comprehension setting in four different domain-specific QA datasets. These results indicate that our T+DAPT approach robustly improves performance of pretraining language models in zero-shot domain QA across several domains, showing that T+DAPT is a promising approach to domain adaptation in lowresource settings for pretrained language models, particularly when directly training on target task data is difficult.
In future work, we intend to explore various methods to improve the performance of T+DAPT by remedying catastrophic forgetting and maximizing knowledge transfer. For this we hope to emulate the regularization used by Xu et al. (2020) and implement multi-task learning and continual learning methods like AdapterNet (Hazan et al., 2018). In order to improve the transferability of learned features, we will explore different auxiliary tasks such as NLI and sentiment analysis in addition to few-shot learning approaches.

Ethical Considerations
Question answering systems are useful tools in complement to human experts, but the "word-of-

BioQA Samples
Q: what sugar is found in rna DAPT: ribose, whereas the sugar in DNA is deoxyribose T+DAPT: ribose Q: normal blood pressure range definition DAPT: 120 mm Hg1 T+DAPT: a blood pressure of 120 mm Hg1 when the heart beats (systolic) and a blood pressure of 80 mm Hg when the heart relaxes (diastolic) MoviesQA Samples Q: what is cyborgs real name DAPT: Victor Stone/Cyborg is a hero from DC comics most famous for being a member of the Teen Titans T+DAPT: Victor Stone Q: who plays klaus baudelaire in the show DAPT: Liam Aiken played the role of Klaus Baudelaire in the 2004 movie A Series of Unfortunate Events. T+DAPT: Liam Aiken Table 3: Samples from BioQA and MoviesQA where T+DAPT achieves exact match with the label answer, and DAPT produces a different answer. Answers from each approach are shown side-by-side for comparison. machine effect" (Longoni and Cian, 2020) demonstrates the effects of a potentially dangerous overtrust in the results of such systems. While the methods proposed in this paper would allow more thorough usage of existing resources, they also bestow confidence and capabilities to models which may not have much domain expertise. T+DAPT models aim to mimic extensively domain-trained models, which are themselves approximations of real experts or source documents. Use of domain adaptation methods for low-data settings could propagate misinformation from a lack of source data. For example, while making an information-retrieval system for biomedical and COVID information could become quicker and less resource-intensive using our approach, people should not rely on such a system for medical advice without extensive counsel from a qualified medical professional.   Table 5: Zero-shot F1 performance of RoBERTa-Base models on NewsQA following different amounts of SQuAD fine-tuning. For comparison the score of our News model from the main paper with 2 epochs and all samples is included as an upper bound, alongside a head tuning baseline where all weights are frozen except the classifier layer.

A.1 Experiment Details and Additional Experiments
Freezing Layer -We tried to freeze the bottom layer after NER training and only train the QA layer on SQuAD, the performance is worse than finetuning the whole RoBERTa and QA layer. NER and QA may not rely on the exact same features for the final task which may be the reason that freezing causes a performance decrease. Different Training Epoch and Training Examples -When selecting the best performance model, we use a validation set in target domain to evaluate the performance. From Table 5, we show our trials with different amounts of SQuAD training in the News Domain and how it affected performance in NewsQA.
Different Training Order -We tried to use different training order, for example, we train on SQuAD1.1 task first and then on NER, the F1 score is 42.15 in CovidQA, which has some improvement, but QA as the last task performs better.
Another Auxiliary Task -In the Covid domain, we also do experiments on a more QA-relevant task, question classification (QCLS) (Wei et al., 2020). We show the result in Table 4. The experiments show that QCLS task have more improvements than NER task. In addition, we test the model trained on CovidQA as the performance upper bound.