Learning to Generalize for Cross-domain QA

There have been growing concerns regarding the out-of-domain generalization ability of natural language processing (NLP) models, particularly in question-answering (QA) tasks. Current synthesized data augmentation methods for QA are hampered by increased training costs. To address this issue, we propose a novel approach that combines prompting methods and linear probing then fine-tuning strategy, which does not entail additional cost. Our method has been theoretically and empirically shown to be effective in enhancing the generalization ability of both generative and discriminative models. Our approach outperforms state-of-the-art baselines, with an average increase in F1 score of 4.5%-7.9%. Furthermore, our method can be easily integrated into any pre-trained models and offers a promising solution to the under-explored cross-domain QA task. We release our source code at GitHub*.


Introduction
Question answering (QA) models (Oh et al., 2016;Trischler et al., 2017;Lewis et al., 2021;Gu et al., 2021) aim to answer passage-based questions automatically with the help of facts in a given context (sometimes referred to as machine reading comprehension (Dua et al., 2019;Sen and Saffari, 2020)).Over the last few years, pre-trained models have achieved great progress on a variety of large-scale datasets, e.g., SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), DROP (Dua et al., 2019), CoRA (Asai et al., 2021), and NarrativeQA (Kocisky et al., 2018).However, existing methods can suffer significant performance degradation when the tuned system is directly applied to outof-domain examples (Gururangan et al., 2018;Wu et al., 2020;Tripuraneni et al., 2020;Kaushik et al., 2020;Malinin et al., 2021;Varshney et al., 2022).This paper focuses on a novel cross-domain QA task where we assume models trained on the source domain can be generalized to the target domain, where no labeled or unlabeled data is available.As shown in Figure 1, QA pairs from different domains have intrinsically different feature distributions.For example, in the technology field, the context can frequently contain "e-commerce" and "network".While in the pharmaceutical sector, the context can consist of "COVID-19", "vaccine", and "diagnostic" more frequently.Cross-domain QA poses significant challenges to real-world scenarios, and it is proved that even large-scale pre-trained models (Gu et al., 2021) can still encounter performance degradation under domain generalization.
To address these drawbacks, we introduce a novel cross-domain QA setting, focusing on the methods that consistently improve the domain generalization performance without additional computational costs.Intuitively, cross-domain QA can benefit from prompting in which instances from different domains can share a unified set of label words.Thus, no additional parameters can carry domain-specific information to hinder the OOD generalization for an unseen domain.However, using the prompt method solely could increase the risk of overfitting and bring limited benefits, as prompt templates are fixed, which may be learned as spurious features by models.Thus, we consider using the linear-probing and then fine-tuning (LP-FT) strategy to reduce the reliance between prompt patterns with labels by freezing pre-trained parameters.In this way, LP-FT can benefit cross-domain QA by preventing pre-trained features from being distorted when tuning on a specific domain (Kumar et al., 2022).Prompting-based LP-FT method does not introduce new parameters, so the performance decay when training on a source domain and testing on a new target domain can be reduced without entailing additional cost.
Under the LP-FT framework, we introduce four prompt types to extract invariant features in different domains: question type, sentiment, named entity, and key phrase.These prompts aim to increase the question similarity and benefit the model in generalizing to out-of-domain questions.Existing prompting methods have not been applied to natural language processing tasks beyond simple fine-tuning settings.To enable promoting methods to adapt LP-FT, we theoretically prove that LP-FT still holds consistently better robustness for prompting methods (Section 3.3).
We experiment on three different domain datasets (Figure 1).Results show that our promptbased LP-FT method significantly improves the performance of cross-domain QA models on either the standard hold-out or OOD tests, with an average increase in F1 of 4.5%-7.9%compared to baselines.Also, our method consistently outperforms the standard fine-tuning strategy on both discriminative and generative models.Besides, we provide an in-depth analysis of the ablation study towards the OOD robustness that details the efficacy of LP-FT and prompting methods, respectively.To our knowledge, we are the first to present a new zero-shot cross-domain QA task and propose a novel Promptbased LP-FT method.All resources are available at https://github.com/FreddieNIU/Prompt-QA.

Related Work
Out-of-domain performance degradation has attracted considerable research interest recently.A line of work (Morgan and Winship, 2015;Wang and Culotta, 2021;Kaushik et al., 2021;Yang et al., 2021;Malkiel and Wolf, 2021;Lu et al., 2022) aims to improve models' generalization ability on text classification.Differently, we investigate the OOD generalization problem on the QA task.Lewis et al. (2021) and Wang et al. (2021) find that 60-70% of test-time answers of popular open-domain QA benchmark datasets exist in the training set, and it is proved that training set memory plays a vital role in testing.Liu et al. (2021a) empirically prove that language models suffer performance degradation when there is no train-test set overlapping.To test the actual generalization ability of QA models, several novel QA datasets have been constructed and released, focusing on evaluating QA models on out-of-domain generalization ability (Gu et al., 2021).Yang et al. (2022) present the first cross-domain QA dataset and observe a performance decay problem regarding the OOD test.Many existing methods intend to improve the OOD performance of QA models through data augmentation.Yue et al. ( 2022) introduce a synthesizing question-answer pairs method to improve target-domain QA performance.In contrast, we propose a prompt-based method combined with linear probing and then fine-tuning, which is more computationally efficient and does not require target domain annotations.
Prompt-based methods on pre-trained language models have received considerable research interest.The paradigm "pre-train, prompt, and predict" replaces the "pre-train, fine-tune" procedure for improved few-shot learning (Liu et al., 2021b).Prompt-based methods have been applied not only in sentence-level few-shot learning tasks, such as named entity recognition (Ma et al., 2021) but also in sophisticated learning tasks like natural language understanding (Wang et al., 2022).However, little work applies prompts on the cross-domain QA tasks (Jacovi et al., 2021).We leverage the fixedformat characteristic of the prompt to extract the invariant features in the changing dataset to enhance the OOD generalization of the model.
Instead of fine-tuning, linear probing is an alternative to tuning a pre-trained model on a downstream task.Liu et al. (2019a);Tripuraneni et al. (2020) examine the representations produced by several recent pre-trained language models and find that linear models trained on top of frozen contextual representations are competitive with stateof-the-art task-specific fine-tuned models in many cases but fail in tasks where fine-grained language knowledge is required.Kumar et al. (2022) theoretically prove that the linear-probing then fine-tuning (LP-FT) approach can enhance the OOD generalization ability of pre-trained models.In our work, we are the first to provide theoretical evidence that LP-FT still holds consistently better robustness for Compared with the baseline model, in the feature space, the prompt-based LPFT model is superior in two respects: the distance between the feature distributions of the two domains is reduced, and the features learned by the model are closer to the intersection of the two domains.The top part demonstrates the linear probing and then fine-tuning process.The "Q", "C" and "P" represent "Question", "Context" and "Prompt" in a sample, respectively (in bottom) . prompting methods in NLP.

Method
Figure 2 illustrates the workflow of the Promptbased LP-FT method.We first generate a prompt for each input sample through template engineering and prompt designing ( §3.1).Then, the source domain dataset with prompts is used for linear probing and then fine-tuning ( §3.2) a pre-trained model.(top).Compared with the baseline model, in the feature space, the prompt-based LPFT model is superior in two respects: the distance between the feature distributions of the two domains is reduced, and the features learned by the model are closer to the intersection of the two domains (middle).The feature space demonstrates how the promptbased method and the LP-FT strategy benefit the cross-domain QA, respectively, and also shows the motivation for using prompt-based LP-FT to benefit the cross-domain QA task.

Template Designing
We take a Template Engineering process to look for a prompt template that results in the most ef-fective performance on a given task.The template designing rules can be found in appendix A.2.The prompt design is inspired by the process of a nonnative speaker (or a non-professional reader) reading articles (or professional documents) and answering questions.They may lack some depth of knowledge, such as the meanings of less commonly used words (or domain-specific knowledge).Language models may encounter similar situations in the cross-domain QA task.We design four types of templates.Figure 3 gives an example of a questiontype template.Other template designs can be found in appendix A.3.Below, we take the question-type template as an example to illustrate the template designing process: Question-type Templates.Suppose that for a given question, "Why have we increased our projections for cancer drug Loxo305 and diabetes drug tripeptide is useful?", a human tester tries to find the answer from the article.In the question, users might not understand tokens such as Loxo305, diabetes, tripeptide, etc.However, if the user is aware that the question might be about "Why something is useful?", then she/he can search some keywords such as because, as, and since from the article and the context following these words, which might help her/him to find the correct answer.For each type of question, some specific words might help to locate their answers.
We consider four typical types of questions.For each question type, we first find out the most related words with it, such as because, since with the question type why, by measuring Pointwise Mutual Information (PMI) scores (Bouma, 2009) between candidate words and the question type.Afterward, we select the 50 most related words to generate a prompt for each question.
Loss Functions.For a prompt-based QA task, given a question-context-prompt tuple (Q, C, P ), we calculate the probability of each word c n being the start position or end position of the correct answer for discriminative models as follows: where Q ∈ R sq×d word , C ∈ R sc×d word , and P ∈ R sp×d word denote the question, context and prompt, respectively.s q /s c /s p and d word denote the question/context/prompt sentence length and the word embedding dimension, respectively.h cn denotes the feature representation of (Q, C, P ) concatenated on the first dimension produced by a pretrained model, W head ∈ R ν×d h and b head ∈ R ν .d h denotes the dimension of h cn and ν denotes the length of answer sentence.The loss function is the sum of the cross entropy for start and end positions.
where m is the number of words in C.
We regard the QA task as a Seq2Seq generation task for generative models and use the LM loss, where c <n denotes the generated words.

Linear Probing then Fine-tuning
The OOD generalization problem is defined as follows (Kumar et al., 2022): given a predictor f : to map inputs X to outputs Y .For some loss function L, the predictor in-domain performance L id and out-of-domain performance L ood are: where the predictor is evaluated on test samples (X, y) drawn from in-domain distribution P id , and also evaluated on test samples (X, y) drawn from out-of-domain distribution P ood .To simplify the formula representation, in this paper, X represents the question, and context (Q and C); y indicates the answer sentence.
The final predictor f is parameterized as a feature extractor and a linear "head".Hence, the training loss is: where v denotes the linear head and B denotes the feature extractor.We assume that the initial feature extractor B 0 is obtained from the pre-trained model, considering two methods to learn a predictor f v,B : 1) linear probing where B = B 0 and the linear head is obtained by minimizing some loss on the training data (Liu et al., 2019a), and 2) fine-tuning where both v and B are updated on the training data with B initialized as B 0 (Kumar et al., 2022).

Theoretical Proof
We prove that linear probing and then fine-tuning improves the results for prompt tuning by extending the proof for standard fine-tuning (Kumar et al., 2022).In particular, the derivative of Eq 5 with respect to the feature extractor B is: For Eq 6, if U is a sample extracted from a direction orthogonal subspace to the training subspace, ∇ B L(v, B)U = 0, the training process on X will not decrease the loss on the orthogonal subspace.However, the gradient is not zero for directions in the ID subspace.This explains why fine-tuning can achieve a higher ID performance but a lower OOD performance.
In our proposed prompt-based method, the prompt P is concatenated to the original X (along the sentence length dimension), and the equation can be expressed below: where assume that P is not orthogonal to the X or its orthogonal subspace.Consequently, we have ∇ B Lp (v, B)(U +P ) ̸ = 0.In this way, the training process on X with prompt P would modify the loss on the OOD samples with the prompt.
In the linear probing and then fine-tuning method, the OOD error of fine-tuning is 8) where v f t and B f t are the linear head and feature extractor after fine-tuning.σ is a fixed parameter (Kumar et al., 2022) In order to decrease the OOD error, the head v 0 has to be as close to the v * as possible.It is proved that initializing the head with v lp (LP-FT) can decrease the OOD error (Kumar et al., 2022) more than random initializing head with v 0 (FT) since v 0 is far away from v * .Converting input X to X + P does not affect min(φ,φ 2 /||w * || 2 ) , implying the LP-FT strategy can be applied after we introduce P .As a result, the Prompt-based LP-FT strategy is used to avoid distorting pre-trained features.

Experimental Setup
We introduce experiments' datasets, baseline methods, and evaluation metrics in this section.

Datasets
We evaluate the proposed method on three datasets: Causal QA (Yang et al., 2022), SQuAD 1.1 (Weissenborn et al., 2017) and NewsQA (Trischler et al., 2017).All datasets are in English.Domain-related information is provided in the CausalQA dataset, which is valuable for cross-domain questionanswering tasks.For the in-domain test, we experiment on the whole CausalQA dataset before splitting into domains(domain-independent QA) and on each particular domain after splitting into domains.The distribution change and word overlap between datasets can be found in appendix A.4.
For the OOD test, we have two experiment setups: Setup 1) we split the CausalQA dataset, based on the domain information, into mutually exclusive training/validation/testing sets in the same ratio of 8:1:1.Setup 2) we conduct OOD tests across different datasets from different domains.In cross-domain QA, both the training and validation sets of the source domain are used in the training process for hyperparameter searching.The testing sets of source and target domains are used for in-domain evaluation and OOD tests, respectively.

Baseline Models
Based on the novel cross-domain QA setting, we establish baselines using generative models -BART (Lewis et al., 2020), T5 (Raffel et al., 2020) and discriminative models -BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019b), and Span-BERT (Joshi et al., 2020) with the help of Huggingface framework (Wolf et al., 2020).We also implement the commonly used domain adaptation method in previous works (Yue et al., 2022;Cao et al., 2020) to compare with our method.The AdamW optimizer has a default learning rate of 10 −5 .Other hyper-parameters are tuned by optimizing the performance on the validation set.The standard fine-tuning strategy is considered a baseline when compared to our methods by using four strategies: 1. Baseline: we select the RoBERTa-base model as the baseline of discriminative methods.
2. Baseline + P: we adopt the same baseline models and fine-tuning strategy, only replacing the original dataset with the prompted dataset.
3. Baseline + LP-FT: we first tune the last linear layer (the "Head" for question answering) parameters and replace the head parameters initialized by Huggingface framework models with the head parameters after linear probing.The original dataset is used in this section.
4. Baseline + P + LP-FT: the LP-FT strategy is adopted on the dataset with the prompt.

Evaluation Metrics
Following previous work (Gu et al., 2021;Yang et al., 2022) answer for each question-answer pair, EM =1.Otherwise, EM = 0.The Macro F1 score is defined as the mean of token-level F1 scores: where i is the token index and N is the length of the golden answer.

Results and Discussion
Our method is applied to both domain-independent QA tasks ( §5.1) and cross-domain QA tasks ( §5.2).

In-domain Performance
For domain-independent QA, the in-domain performance represents the model performance using the traditional hold-out test, where both the training set and test set come from the whole dataset without splitting domains.The domain-independent results are shown in Table 2, where the Prompt LP-FT method brings performance gain over both the BART model (in average +3.64% in F1, +4.95% in EM) and the RoBERTa model (in average +0.42% in F1, +0.14% in EM).Taking the BART model as an example, BART+LP-FT achieves slightly better performance (+0.57%) compared with BART, which shows the LP-FT method brings limited benefits to the model on the domain-independent QA task.However, BART+P (+2.07%) over BART outperforms BART+LP-FT (+0.57%) over BART, which shows that the prompt-based method can benefit the model without splitting domains.
In Table 1, the numbers on the diagonal represent the ID performance on each domain, and the values in parentheses below represent the in-domain performance increase brought by our method (in average +2.21% in F1 and +2.82% in EM) (left).Though the performance gain on each domain varies , our method consistently improves the performance of in-domain evaluations.

Out-of-domain performance
Results on CausalQA.The experiment results of cross-domain CausalQA are shown in 3 × 3 tables Table 1 where each row represents contrast experiments with the same testing data, and each column represents the model performance on different test sets.The numbers not on the diagonal represent the performance tested on a domain different from the training domain, called OOD test results.Overall, the proposed method benefits the OOD performance by an average of +3.11% in F1 and 4.76% in EM on BART and by an average of +2.39% in F1 and 4.51% in EM on RoBERTa.For example, by comparison in the same scenario, we find that the improvement on Consumer-Train/Industrial-Test on BART (+3.95%) is more significant than the improvement on Consumer-Train/Industrial-Test based on RoBERTa (+2.74%).Our method brings larger performance gains for generative models (BART) than discriminative models (RoBERTa).These results show that the performance benefits based on discriminative models are less than generative models by using Prompt LP-FT.Intuitively, this can be because the added prompt can be used directly to generate answers as we fine-tune BART in a Seq2Seq manner.
In Table 1, we compare the performance of BART on Consumer-Train/Consumer-Test to Consumer-Train/Industrial-Test.Our method improves the performance by +0.36% on the consumer test set and +3.95% when testing on the samples from the industrial domain, indicating that the proposed method is better for crossdomain generalization.Moreover, the benefit on Consumer-Train/Technology-Test (+2.78%) is relatively small compared to the improvement on Consumer-Train/Industrial-Test (+3.95%).It hints that the same prompt has variant effects on different domains.This can be because different domains have intrinsically different feature distributions.
OOD Tests Between Different Datasets.We show the OOD generalization results between different popular datasets in Table 3.It can be seen that the Prompt LP-FT method improves the OOD test performance of RoBERTa on average by 2.54% on three data sets and 2.60% for BART.It is worth noting that our method brings a performance improvement of up to 7.42% (NewsQA-CausalQA) on RoBERTa, while the maximum performance improvement reaches 8.89% (SQuAD-CausalQA) on BART.The result is consistent with the finding in Setup 1 that Prompt LP-FT can benefit generative models more than discriminative ones.
It is noteworthy that even though our method assumes that no target domain annotations are available (zero-shot), the baseline method using Domain Adaptation (DA) assumes that a small number of target annotations are available (few-shot), our method can consistently achieve better performance than the DA method in all six settings.These results based on the OOD generalization among three datasets suggest that Prompt LP-FT is a highly robust, easy-to-transfer, and convincing method to improve the cross-domain generalization ability of QA models.

Discussion
We provide discussion to understand better the relative contributions of Prompt LP-FT toward performance improvement, including the universality of our method, the ablation study, and case study.Universality.The results in Sec 5.1, 5.2 show that our proposed method improves the OOD generalization performance of various models to varying degrees, with ID performance increasing as well.Experimental results on multiple models demonstrate that our method holds good portability and can benefit variant models, including generative (BART) and discriminative (RoBERTa) models.Results on more backbone models (e.g., T5 and SpanBERT) can be found in appendix A.5. Ablation Study. Figure 4 shows an ablation study of Prompt-based LP-FT.We find that the combination of prompting methods with LP-FT achieves the best performance in four of six settings, illustrating the advantage of prompt-based LP-FT.In addition, BART+Prompt shows an absolute advantage compared to BART+LP-FT, which can be because prompts benefit the cross-domain QA task by introducing more background knowledge than the adjustment of the tuning strategy.The detailed ablation results are shown in Appendix A.5. Case Study Table 4 presents a case study of four test samples.For each instance, we show the input context, the prompt sentence, and the output predicted by the baseline method and our method (Prompt LP-FT).It can be seen that the gold an- However, Avnet's management conceded..... Suppliers get access to a broad range of customers by using Avnet without having to make significant investment in sales and engineering teams.In exchange for these services, Avnet can generate ...... Prompt: The sentiment of this question is positive, you need look for "significant" What will be prevented if suppliers get access to a broad range of customers?investment in sales and engineering using Avnet using Avnet without having to make significant investment in sales and engineering Generative Model: BART-base At the end of 2020, the store base had grown about 29% over the prior five-year period, to about 1,920 locations (around 2,100 including Petsense), driving sales and EPS compound annual growth rates over the past three years of 14% and 27%, respectively.We forecast that the firm will grow to around ....... Prompt: The entity "EPS" is mentioned in the question.This timing "annual" is mentioned in the question.
Why sales and EPS compound annual growth rates increase?
the store base had grown about 29% over the prior five-year period 14% and 27% the store base had grown about 29% over the prior five years Finally, we view the likelihood of sustained economic value creation as quite high for the restaurant brand, which finds itself on the leading edge of most of the changes we expect to impact the restaurant industry over the medium to long term.Though Chipotle saw economic value destruction in 201 ...... Prompt: "restaurant industry" is an important phrase.And also pay attention to these words:"edge", "changes" What will happen if on the leading edge of most of the changes we expect to impact the restaurant industry?
the likelihood of sustained economic value creation as quite high for the restaurant brand over the medium to long term we view the likelihood of sustained economic value creation as quite high for the restaurant brand Table 4: Case study of "Why" and "What-if" questions answering tasks based on the SpanBERT-base and BARTbase models.The Gold Answer is highlighted using the green text, while the Incorrect Answer predicted by the baseline method is highlighted by the red text.swers are mostly included in the output of Prompt LP-FT, while the output of baseline models is prone to errors.Specifically, baseline models, including SpanBERT-base and BART-base, tend to output the answers closer to the question in the context instead of observing the whole sentence.For example, for the question "What will be prevented if suppliers ... customers?", the SpanBERT-base model will output the wrong answer -"using Avnet" that is close to the question in the context -while the correct answer -"investment in sales and engineering" is ignored.These comparisons provide evidence that our method is beneficial in address-ing the spurious features of sentence order for QA models.This can be because the well-designed prompt combined with LP-FT helps QA models understand the context better.

Conclusion
We introduce a zero-shot cross-domain QA task and present a novel Prompt-based LP-FT method by combining prompt with a linear-probing finetuning strategy, providing theoretical proof that the proposed method can enhance the model's indomain performance and out-of-domain generalizability, and empirically showing that the Prompt LP-FT method consistently benefits the QA models.
Experimental results show that (1) current methods still have a lag much behind human-level towards the cross-domain QA generalization; (2) our method brings larger performance gains for generative models than discriminative models; (3) the use of the prompt-based LP-FT in other NLP tasks is worth trying.Meanwhile, the emergent ability of LLMs will definitely decrease the challenge of the current cross-domain QA setting.Designing challenging datasets of cross-domain QA towards LLMs should be paid more attention in the future.

Limitation
Our method has a few limitations which may inspire future work.First, the prompt templates are manually designed, although we've introduced the rules and intuitions used in our implementation.
Second, the proposed method may have low scalability to long text.Because we add the prompt at the end of the context, the prompt would be truncated if the context itself exceeds the maximum acceptable token length of the model.

A.1 Template Comparison
As we see in Table 5, changing "But" in the template to "And" alters the logical relationship between the preceding and following sentences, which had an impact of more than 1% on the performance.

A.2 Template Engineering
The main objective of applying prompt templates is to enhance the model's out-of-domain performance by extracting invariant features between different domain questions.Therefore, the first rule is that a designed template should avoid containing domainrelated information.For example, "This [health] company [Hologic] is mentioned in the question."should not be an ideal template because it involves extra domain information that Hologic is a health company.Second, a template should be a proper English sentence with correct spelling, no grammar mistakes, and proper semantic meaning.Our experiment shows that one wrong word in a template may cause significant performance variation (see Appendix A.1).
Third, since templates are concatenated at the end of the context, templates cannot be too long.If a template has almost the same length as the context or even longer, it will double the amount of input data and thus increase the computational cost of the model; more importantly, it may deprive the leader role of the context, which may make the model too generalized to capture the answers.
Fourth, there are two main varieties of prompt templates: cloze prompt and prefix prompt.(Liu et al., 2021b) Cloze prompt fill in the blanks of a textual string, and the prefix prompt continue a string prefix.Instead of using only one type, we include both variants of templates in the designed four prompt templates.
According to these rules, we design four types of templates, of which each type has different sentence patterns.Template generalization is modulized as a two-step process: 1) generating the prompt words, and 2) filling in the blanks (Liu et al., 2021b But also pay attention to these words: __"

69.57
Experiment2 "There is no important phrase in this query.And also pay attention to these words: __" 70.84 Table 5: The effect of using an improper word in a template.

A.3 Template Designing
Sentiment Templates Assume that a person unfamiliar with the restaurant industry tries to answer the question, "Why the global restaurant sector has come under pressure?".This person can easily find that this question concerns the factors that adversely affect restaurants even without industry knowledge.Therefore, looking for negative words from the context, like destroyed, restricted etc., may help to locate the correct answer.Based on the intuition above, we implement a sentiment analysis framework * to give each question and each word in the answer sentence a sentiment score.Afterwards, the highest positive or negative scores are selected to be used as the prompt words.Second, the sentient of the question and the prompt words are filled in the blanks of sentiment templates.Named Entity Templates Unique entities mentioned in a question could hint at answering the question.Hence, a named entity recognition framework is applied to each question.We intend to recognize five types of entities mentioned in the question: Person, Organization, Location, Country, and Date.Entities not included in the five types are assigned as "Other" entities.Step two fills the recognized entities in the blanks as prompt words.
Phrase Template Phrases are usually the question subject, potentially valuable in locating the correct answer.A simple strategy is designed to find out the phrases composed of an adjective(s) and noun(s).For example, "hybrid environments", "software-as-a-service applications", and "remote access" are phrases in a question.These phrases are selected as prompt words and filled in the blanks in step two.

A.5 Experiment Results on Other Models
On both domain-independent QA and crossdomain QA tasks, the SpanBERT model achieves state-of-the-art performance.Table 6 shows the domain-independent QA result of SpanBERT-base and SpanBERT-large model, which also provides evidence that the proposed method works on the large model which can achieve better results than it on the base model.Tab7 shows the result of the Span-BERT OOD test.
Tab 8 shows the cross-domain QA experiment results on T5-base.We show that our method can significantly improve the cross-domain QA performance compared to the standard fine-tuning results based on the CausalQA dataset.
Tab 9, 10 and Fig 7 are the ablation study results on RoBERTa and BART models for cross-domain QA task on the CausalQA dataset.

A.6 Details of experimental results
The experiment is conducted on a GTX 3090 TI with 24GB graphics RAM size.The average training time for each model on the domain-independent QA task is around 2.5 hours, and on the crossdomain QA task is around 30 minutes on CausalQA dataset.On SQuAD and NewsQA dataset, the average training time for each model is around 3 hours.For each experiment setting, we run 10 repeated experiments and report the average results.The model name indicates the base model is no size   specification, e.g."BART+P" indicates the BARTbase model plus the prompting method.We also implemented large models to prove the effectiveness of the proposed models.
For the hyperparameter tuning, we split the whole dataset into train/validation/test sets on the domain-independent QA task and use the validation set for hyperparameter tuning.On the crossdomain QA task, we split the dataset of each domain into train/validation/test sets and use the validation set that comes from the same domain with the training set for hyperparameter tuning.The criterion used to select the hyperparameter is the F1 on the validation set.We first select a series of candidate values of a hyperparameter through uniform sampling from a reasonable range, then select the value that achieves the best F1 on the validation set.Three repeated trials decide the value of a hyperparameter.For example, we give the best-performing RoBERTa-base model configuration on Consumer-Train/Technology-Test experiment as follows: the learning rate for linear-probing is 10 −6 , the number of epochs for linear probing is 3, the learning rate for fine-tuning is 10 −5 , the training batch size is 4, parameters are updated every 8 batches, and the number of epochs for fine-tuning is 14.

Figure 1 :
Figure 1: Cross-domain QA task consists of three datasets from different domains.

Figure 2 :
Figure 2: The workflow of the prompt-based linear-probing then fine-tuning strategy.The bottom part shows the template engineering process where we add a prompt for each sample.The middle part represents the feature space of the Prompt LP-FT model and baseline models.Compared with the baseline model, in the feature space, the prompt-based LPFT model is superior in two respects: the distance between the feature distributions of the two domains is reduced, and the features learned by the model are closer to the intersection of the two domains.The top part demonstrates the linear probing and then fine-tuning process.The "Q", "C" and "P" represent "Question", "Context" and "Prompt" in a sample, respectively (in bottom)

Figure 4 :
Figure 4: Ablation results based on the BART-base model.
Xiang Yue, Ziyu Yao, and Huan Sun.2022.Synthetic question value estimation for domain adaptation of question answering.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1340-1351.

Figure 5 :
Figure 5: The word overlap between different datasets considered by the CasualQA task.

Figure 7 :
Figure 7: Ablation study results based on the RoBERTabase model for cross-domain QA.

Table 1 :
Out-of-domain test results of the BART-base model (Left) and RoBERTa-base (Right) on CausalQA (Setup 1).The numbers in brackets represent the performance improved by our method."Ori" denotes the original fine-tuning method, and "Ours" denotes the Prompt-based LP-FT method.
, The Macro F1-score (F1) and exact match (EM) are used to evaluate the model's performance.If the predicted answer matches the true

Table 2 :
Results of Domain-Independent QA on CausalQA dataset.'F1' refers to Macro F1, EM refers to exact match.The model name refers to base models, "+P" denotes the base model+prompt method, "+LP-FT" denotes the base model+LP-FT method.

Table 3 :
OOD test results on SQuAD (S), CausalQA (C), and NewsQA (N) (Setup 2).[S->N] represents that the model is trained on SQuAD while tested on NewsQA."+P" represents the prompting methods."+DA" represents the Domain Adaptation method (Yue et al., 2022).The proposed method shows statistically significant improvements compared to the baseline model indicated by * using Student T-test (p<0.01,10-time run).
As Terex has expanded its MP product line, it has captured a larger global market share of the industry, allowing it to gain greater insight into customer demand.This may provide revenue synergies in the future...... Prompt: To answer a "Why useful" question, you need to look for "allowing" ).

Table 6 :
Domain-independent QA results of SpanBERTbase and SpanBERT-large model.
* implemented using the NLTK module

Table 7 :
Out-of-domain test results of SpanBERT-base.A.4 Word Overlap between Datasets Fig 5 and 6 show the word overlap percentage between different domains of the CausalQA dataset, and also on datasets from different domains, i.e., between the SQuAD, CausalQA and NewsQA datasets.

Table 8 :
Out-of-domain test results of the T5-base model.Numbers in brackets represent the performance improved by our method.