Robust Question Answering against Distribution Shifts with Test-Time Adaptation: An Empirical Study

A deployed question answering (QA) model can easily fail when the test data has a distribution shift compared to the training data. Robustness tuning (RT) methods have been widely studied to enhance model robustness against distribution shifts before model deployment. However, can we improve a model after deployment? To answer this question, we evaluate test-time adaptation (TTA) to improve a model after deployment. We first introduce COLDQA, a unified evaluation benchmark for robust QA against text corruption and changes in language and domain. We then evaluate previous TTA methods on COLDQA and compare them to RT methods. We also propose a novel TTA method called online imitation learning (OIL). Through extensive experiments, we find that TTA is comparable to RT methods, and applying TTA after RT can significantly boost the performance on COLDQA. Our proposed OIL improves TTA to be more robust to variation in hyper-parameters and test distributions over time.


Introduction
How to build a trustworthy NLP system that is robust to distribution shifts is important, since the real world is changing dynamically and a system can easily fail when the test data has a distribution shift compared to the training data (Ribeiro et al., 2020;Wang et al., 2022).Much previous work on robustness evaluation has found model failures on shifted test data.For example, question answering (QA) models are brittle when dealing with paraphrased questions (Gan and Ng, 2019), models for task-oriented dialogues fail to understand corrupted input (Liu et al., 2021a;Peng et al., 2021), and neural machine translation degrades on noisy text input (Belinkov and Bisk, 2018).In this 1 Our source code is available at https://github.com/oceanypt/coldqa-tta work, we study the robustness of QA models to out-of-distribution (OOD) test-time data.
To build a model that is robust against distribution shifts, most previous work focuses on robustness tuning (RT) methods that improve model generalization pre-deployment, such as adversarial training (Madry et al., 2018).However, can we continually enhance a model post-deployment?To answer this question, we study and evaluate testtime adaptation (TTA) for robust QA after model deployment.TTA generalizes a model by continually updating the model with test-time data (Sun et al., 2020).As shown in Fig. 1, in this work, we focus on test-time adaptation in real time, where the model predicts and updates over a data stream on the fly.For each test data instance, the model first returns its prediction and then updates itself with the test data.Unlike unsupervised domain adaptation (Ramponi and Plank, 2020) studied in NLP, TTA is suitable for domain generalization, since it makes no assumption about the target distribution and could adapt the model to any arbitrary distribution at test time.
We discuss TTA methods in §3, where we first present previous popular TTA baselines, and then introduce our newly proposed TTA method, online imitation learning (OIL).OIL is inspired by imitation learning, where the adapted model learns to clone the actions made by the source model, and the source model aims to reduce overfitting to noisy pseudo-labels in the adapted model.We further adopt causal inference to control model bias from the source model.Next, to compare to TTA methods, we briefly discuss previous robustness tuning (RT) methods such as adversarial training in §4.
To study and analyze TTA for robust QA postdeployment, we introduce COLDQA in §5 which is a unified evaluation benchmark for robust QA against distribution shifts from text corruption, language change, and domain change.It differs from previous benchmarks that only study one type of distribution shifts (Ravichander et al., 2021;Hu et al., 2020;Fisch et al., 2019).COLDQA expects a QA model to generalize well to all three types of distribution shifts.
Our contributions in this work include: Based on the experimental results in §6, we report the following findings: • COLDQA is challenging and not all RT methods are effective on COLDQA ( §6.2); • Overall, as Fig. 2 shows, TTA is comparable to RT, and applying TTA after RT can further boost model performance ( §6.2); • Compared to previous TTA baselines, OIL is more robust to changes in hyper-parameters and test distributions over time ( §6.3).

Related Work
Robust QA Much previous work on model robustness evaluation has shown that NLP models fail on test data with distribution shifts (Rychalska et al., 2019;Ribeiro et al., 2020;Wang et al., 2022) compared to the training data.For QA tasks, Ravichander et al. (2021) study how text corruption affects QA performance.Lewis et al. (2020) and Artetxe et al. (2020) analyze cross-lingual transfer of a QA system.Fisch et al. (2019) benchmark the generalization of QA models to data with domain shift.In this work, we jointly study distribution shifts due to corruption, language change, and domain change.Adversarial samples cause another type of distribution shifts (Jia and Liang, 2017) which is not studied in this work.Hard samples (Ye et al., 2022), dataset bias (Tu et al., 2020), and other robustness issues are not the focus of this work.
Test-Time Adaptation TTA adapts a source model with test-time data from a target distribution.TTA has been verified to be very effective in image recognition (Sun et al., 2020;Wang et al., 2021b;Liu et al., 2021b;Bartler et al., 2022).In NLP, Wang et al. (2021d) Li et al., 2020;Ye et al., 2020;Karouzos et al., 2021).UDA tries to minimize the gap between the source and target domain.Recent work studies UDA without knowing the source domain (Liang et al., 2020;Su et al., 2022), which means the model can be adapted to any unseen target domain on the fly.However, they assume all target data is available when performing adaptation, unlike  ato et al., 2017;Madry et al., 2018;Zhu et al., 2020;Wang et al., 2021a).Some work also uses regularization to improve model generalization (Wang et al., 2021c;Zheng et al., 2021;Cheng et al., 2021;Jiang et al., 2020).Prompt tuning (Lester et al., 2021)

Test-Time Adaptation
Problem Definition Given a source model π 0 trained on a source distribution S, test-time adaptation (TTA) adapts the model to the test distribution T with the test data, which enhances the model post-deployment.In the setting of online adaptation, test-time data comes in a stream 2 .As shown in Fig. 1, at time t, for the test data x t ∼ T , the model π t will first predict its labels y t to return to the end user.Next, π t adapts itself with a TTA method and the adapted model will be carried forward to time t+1.The process can proceed without stopping as more test data arrive.There is no access to the gold labels of test data in the whole process.We compare the setting studied in this work, which is online test-time adaptation, with unsupervised domain adaptation and robustness tuning in Table 1.
2 In this work, we do not study offline adaptation.

TTA with Tent and PL
We first discuss two prior TTA methods, Tent (Wang et al., 2021b) and PL (Lee, 2013).Tent adapts the model by entropy minimization, in which the model predicts the outputs over test-time data and calculates the entropy loss for optimization.Similarly, PL is a pseudo-labeling method, predicting the pseudo-labels on test-time data and calculating the cross-entropy loss.Tent is simple yet it achieves SOTA performance on computer vision (CV) tasks such as image classification, compared to other more complex methods, such as TTT (Sun et al., 2020) which needs to modify the training process by introducing extra selfsupervised losses.Other TTA methods improve over Tent (Bartler et al., 2022;Liu et al., 2021b), but they are much more complex.Formally, Tent and PL start from the source model π 0 .At time t, the model π t updates itself with the test data x t .The loss for optimization is denoted as l t (π t ): where p t is the predicted probabilities over the output classes of x t from the model π t , and y t = arg max i p t [i].H(•) and H(, ) are the entropy and cross-entropy loss respectively.On the data x t , the model is optimized with only one gradient step to get π ′ t : π ′ t ← π t .Then the model π ′ t will be carried forward to time t + 1: π t+1 ← π ′ t .

Online Imitation Learning
Adapting by the model alone, Tent and PL may easily lose the ability to predict correct labels, since the labels predicted by them are not verified to be correct and learning with such noisy signals may degrade the model.The model may not recover again once it starts to deteriorate.To overcome such an issue, inspired by imitation learning (Ross et al., 2011), we propose online imitation learning (OIL) in this work.OIL aims to train a learner (or model) π by the supervision of an expert π e in a data stream.The expert can help the model to be more robust throughout model adaptation, since the expert is stable and the learner clones the behavior of the expert.Formally, at each time t, the expert π e takes an action (makes a prediction) ŷt ∼ π e on x t ∼ T .The learner π t then learns to clone such an action by optimizing a surrogate objective l t : where y t is the action taken by the learner at time t and L(, ; ) measures the distance between the two actions.Formally, at time T , with a sequence of online loss functions {l t } T t=1 and the learners Π = {π t } T t=1 , the regret R(T ) is defined as: where we try to minimize such regret during adaptation, which is equal to optimizing the loss function l t (π t ) at each time t (Ross et al., 2011).

Instantiation of TTA with OIL
At time 0, both the learner π and the expert π e are initialized by the source model π 0 .At time t, the loss function l t (π t ) for optimization is: where p t is the predicted probabilities over the output classes of x t from the learner π t .ŷt = arg max i pt [i] in which pt is the corresponding predicted probabilities of the expert π e .Same as Tent and PL, the model is also optimized with one gradient step to get π ′ t : π ′ t ← π t , and the model π ′ t is carried forward to time t + 1: π t+1 ← π ′ t .For the expert, we can also update it by using the model parameters of the learner.At time t, we update the expert as follows: where θ represents the model parameters and α is a hyper-parameter to control the updating of the expert.α is set to a high value such as 0.99 or 1, so the expert stays close to the source model π 0 in the adaptation process.Here, the expert is also similar to the mean teacher (Tarvainen and Valpola, 2017).
Furthermore, since the expert is initialized by the source model and because of distribution shift, the actions taken by the expert may be noisy.We can filter and try not to learn these noisy actions.Then the loss function in Eq. 4 becomes: where the cross-entropy loss H(p t , ŷt ) is used to identify the noisy actions, and γ is a hyper- parameter serving as a threshold.

Enhancing OIL with Causal Inference
Since the expert is initialized by the source model, when it predicts labels on the test data, its behavior will be affected by the knowledge that it has learned from the source distribution, which is what we call model bias in this work.Since the test distribution is different from the source distribution, and the expert provides instructions to the learner to clone, such model bias will have a negative effect on the learning of the learner.Here, we further use causal inference (Pearl, 2009) to reduce such effect caused by model bias.

Causal Graph
We assume that the model output of the learner π is affected by direct and indirect effect from the input, as shown in the causal graph in Fig. 3a.The causal graph includes the variables which are the input X, the output Y , and the potential model bias M from the expert.X → Y is the direct effect.X → M → Y represents the indirect effect, where M is a mediator between X and Y .
M is determined by the input X, which can come from in-distribution or out-of-distribution data.
Causal Effects Our goal in causal inference is to keep the direct effect but control or remove the indirect effect.As shown in Fig. 3b, we calculate the total direct effect (TDE) along X → Y as follows: where do operation is the causal intervention (Glymour et al., 2016) which is to remove the confounders to X.However, since there is no confounder to X in our assumption, we just omit it.
Model Training Given the total direct effect in Eq. 7, we first have to learn the left term Y y|X=x which is the combination of the direct and indirect effect along X → Y and X → M → Y respectively.We use the learner π to learn the direct effect.For the indirect effect, the model bias of M exhibits different behaviors to data from different distributions.Since the learner π and the expert π e Algorithm 1 Online Imitation Learning Require: Source model π 0 ; memory bank size K; α for expert updating; γ for filtering noisy actions; β for controlling indirect effect.1: Initialize the expert π e ← π 0 ; 2: for t = 1, 2, • • • do 3: Return predictions on x t using Eq.9; 4: Use x k to update the learner π t as in Eq. 8; 7: Update the expert π e as in Eq. 5; 8: end for 9: end for capture the test and source distribution respectively, we use the discrepancy in their outputs to represent the model bias.Considering the model bias, the loss function l t in Eq. 6 becomes: where p t and pt are the predicted probabilities over the output classes of the learner π t and the expert π e respectively.p t captures the direct effect and p t − pt learns the indirect effect.Inference When performing inference, we take the action y which has the largest TDE value.Based on Eq. 7 for TDE calculation, we obtain the prediction over the input x t using the learner π t as: where β controls the contribution of the indirect effect.Here, when calculating the TDE score, we assume the model output is zero when given the null input x 0 , since we assume the model cannot make predictions without the given input.We set β to 1 throughout the experiments, which completely eliminates the effect of model bias.

Implementation of TTA for the QA Task
For extractive question answering, the model needs to predict the start and end position.The above TTA methods treat the two positions independently and apply the same loss, i.e., l t (π t ), to them separately, and the final loss takes the average of the two.We present the pseudocode of OIL in Algorithm 1, where Tent and PL follow the same procedure but with different losses to update.The data x t at each time t is a batch of instances.We preserve a mem-ory bank with size K to store the data from time t − K to t, which more fully exploits test-time data for model adaptation.At each time t, we enqueue x t and dequeue x t−K from the memory bank.Then each batch of data from the memory bank is used to optimize the online loss as shown in Eq. 8.The expert for OIL is updated accordingly.

Robustness Tuning
In contrast to improving the model postdeployment with TTA, robustness tuning (RT) enhances the model pre-deployment.RT has been studied in NLP to improve model generalization (Wang et al., 2022).RT methods are applied at training time when training the source model.We also benchmark RT methods on COLDQA to compare with TTA methods.
First, we compare with adversarial training methods, which are FGM (Miyato et al., 2017), PGD (Madry et al., 2018), FreeLB (Zhu et al., 2020), and InfoBERT (Wang et al., 2021a).Next, we further evaluate robustness tuning methods proposed for cross-lingual transfer, which are MVR (Wang et al., 2021c) and xTune (Zheng et al., 2021).These two methods use regularization to enhance model robustness.All of these methods have not been comprehensively evaluated on distribution shifts arising from text corruption, language change, and domain change.Combination of RT and TTA.Finally, we also study combining RT and TTA methods.The source model is tuned by a RT method, then this model is adapted by a TTA method to the test distribution.

COLDQA
To study robust QA under distribution shifts, in this work we introduce COLDQA, a unified evaluation benchmark against text corruption, language change, and domain change.As shown in Table 2, we collect some existing QA datasets to construct the source and target distributions for COLDQA.Source Distribution The training data for the source distribution is SQuAD v1.1 (Rajpurkar et al., 2016).To evaluate model generalization on COLDQA, we first need to train a source model with the source training data.Next, we evaluate the model on each subset of each target dataset.For test-time adaptation, the model needs to be adapted with the test data on the fly.To evaluate performance under all kinds of distribution shifts, we use a multilingual pre-trained language model as the   OIL 64.57 77.93 76.13 86.72 73.69 84.61 65.83 80.12 56.24 74.34 51.00 63.86 Table 3: Benchmarking results (%) on COLDQA for XLMR-base and XLMR-large.Each TTA method is run three times with random seeds and the average results are reported.Bold: the best results.
base model since it maps different languages into a shared representation space.Target Distributions We study the following target distribution shifts at test time.
• Text Corruption We use NoiseQA to evaluate model robustness to text corruption.NoiseQA (Ravichander et al., 2021) studies noises from real-world interfaces, i.e., speech recognizers, keyboards, and translation systems.When humans use these interfaces, the questions asked may contain noises, which degrade the QA system's performance.NoiseQA includes two subsets, NoiseQA-na and NoiseQA-syn.NoiseQA-na has real-world noises annotated by human annotators, while NoiseQA-syn is synthetically generated.
• Language Change A robust QA system should also perform well when the inputs are in other languages.We use the datasets XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020), designed for cross-lingual transfer, to evaluate change of language in the test data.
• Domain Change The test data may come from a domain different from the source domain used for model training.Here, the training and test domains are in the same language without any text corruption.We use the datasets from MRQA (Fisch et al., 2019) for evaluation, which include HotpotQA (Yang et al., 2018), NaturalQA (Kwiatkowski et al., 2019), NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), andTriviaQA (Joshi et al., 2017).The development sets of these datasets are used.Comparison to Existing Benchmarks To the best of our knowledge, COLDQA is the first benchmark that unifies robustness evaluation over text corruption, language change, and domain change.Previous benchmarks for robust QA usually only study one type of these distribution shifts, e.g., NoiseQA (Ravichander et al., 2021), XTREME (Hu et al., 2020), and MRQA (Fisch et al., 2019) study text corruption, language change, and domain change respectively, where the methods proposed on these benchmarks are tested only on one type of distribution shifts.So it is unclear if prior proposed methods generalize well to other types of distribution shifts.In contrast, COLDQA evaluates a method on all types of distribution shifts mentioned above, a more challenging task to tackle.

Experiments 6.1 Setup
To carry out comprehensive evaluation on all types of distribution shifts, we use a multilingual pretrained language model as the base model, specifically XLMR-base and XLMR-large (Conneau et al., 2020).To train the source model on SQuAD with vanilla fine-tuning, we use the default training setup from Hu et al. (2020).For robustness tuning, we use the hyper-parameter values suggested by Wang et al. (2021a) to train FreeLB and InfoBERT.For MVR and xTune, the default settings from the original work are used3 .
For test-time adaptation, the details of setting the hyper-parameter values for learning rate, batch size, α, γ, and K are given in Appendix A and shown in Table 10.All model parameters are updated during adaptation.Dropout is turned off for Tent and PL when generating the model outputs or pseudo-labels.The adaptation time of OIL on each dataset from COLDQA is shown in Table 6,  7, and 8.All experiments were performed on one NVIDIA A100 GPU.

Main Results
Table 3 shows the benchmarking results of TTA, RT, and their combination on COLDQA.The detailed results on each subset of MRQA are reported in Table 4.We have the following observations.COLDQA is challenging on which not all RT methods are effective.
In Fig. 7 from the appendix, we report the gains of RT baselines over vanilla fine-tuning on the development set of SQuAD.Not surprisingly, each RT baseline improves the model results on the in-distribution set.However, after re-benchmarking the RT baselines on COLDQA, we see that xTune and MVR are more effective than the adversarial training baselines.Among the adversarial training methods, only PGD and FreeLB can improve the average results but the improvements are marginal.Overall, COLDQA introduces new challenges to the existing RT methods.OIL is stronger than PL and Tent.Tent is much less effective than OIL and PL on COLDQA though it is a very strong baseline on CV tasks (Wang et al., 2021b).This shows the necessity of re-analyzing TTA methods on QA tasks.OIL is consistently better than PL based on the average results in Table 3. OIL mostly outperforms Tent and PL based on the detailed results of MRQA in Table 4. TTA and RT are both effective and they are comparable to each other.
On XLMR-base and XLMR-large, both TTA (OIL and PL) and RT (xTune and MVR) can significantly improve the average results by around 1-3 absolute points.Overall, TTA and RT are comparable to each other.More specifically, on XLMR-large, the best TTA method which is OIL outperforms xTune on the average results.On XLMR-large, OIL is better than xTune on NoiseQA and MRQA, but lags behind xTune on XQuAD and MLQA.However, on XLMR-base, TTA does not outperform RT.We Learning rate is selected from {5e-6, 1e-6, 5e-7, 1e-7} and memory size from {1, 3, 5}  3), and the model needs to be adapted continually without stopping.In Fig. 5, we adapt the source model from the test language of es to the language hi without stopping.On each test distribution, we report the relative gain over the source model without adaptation.We find that PL is less robust in such a setting and often has negative gains, especially in the last few adaptations.However, our proposed method OIL achieves positive gains among nearly all adaptations, which demonstrates the robustness of OIL in continual adaptation.
Effects of Causal Inference.
Table 5 shows the effects of removing causal inference in OIL.Without causal inference, adaptation performance consistently drops on the test sets.In Fig. 6, we further show how β affects causal inference.β does affect the final results and the optimal β value varies with different datasets.To avoid tuning β, model bias is completely removed by setting β to 1 in all our experiments.

Conclusion
We study test-time adaptation (TTA) for robust question answering under distribution shifts.A unified evaluation benchmark, COLDQA, over text corruption, language change, and domain change is provided.A novel TTA method, OIL, is proposed that achieves good performance when combined  with a robustness tuning method.

A Appendix
A.1 Hyper-parameters We provide the values of hyper-parameters for testtime adaptation.
(1) For learning rate, we select a value smaller than the one used for training the source model.We set the learning rate to 1e-6.(2) For batch size, for smaller test sets, we set the batch size to 8.For larger test sets, we set the batch size to 16. (3) For α used in updating the expert model, if the test set is large, we set α to a larger value such as 1.Otherwise, we set α to a smaller value such as 0.99.(4) For γ used in filtering the noisy labels, γ = ∞ works well for most of the test sets, except the datasets NoiseQA, XQuAD, and NaturalQA, where we set γ to 0.5.(5) For memory size K, we set K to a smaller value for large sets but to a larger value for small sets.The specific hyper-parameters used for TTA baselines are presented in Table 10.

B Effects of Denoising in OIL
Table 9 shows the effects of denoising in OIL.For NoiseQA and XQuAD, we set γ to 0.5 to filter out the noisy labels.When using XLMR-large as the base model for NoiseQA-na, the average performance drops substantially if noisy labels are not removed.
Table 10: Hyper-parameters for TTA baselines.For MRQA, we find that for PL and OIL, when we keep the same learning rate and batch size as OIL, the final results are bad, so we choose better hyper-parameters for PL and Tent as the table shows.

Figure 1 :
Figure 1: Illustration of test-time adaptation.π represents the model being adapted.x t is the test data at time t and y t is the returned predictions for x t .

Figure 2 :
Figure 2: The average results of RT, TTA, and RT+TTA on COLDQA.RT+TTA significantly improves over RT and TTA.

Figure 3 :
Figure 3: (a) The proposed causal graph.(b) The calculation of total direct effect as in Eq. 7.

Figure 4 :
Figure 4: Robustness of PL and OIL to variance of hyper-parameter values.With OIL and PL, we adapt XLMR-base tuned by xTune by using various value combinations of hyper-parameters, which are learning rate and memory size.Learning rate is selected from {5e-6, 1e-6, 5e-7, 1e-7} and memory size from {1, 3, 5}.For OIL, different values of α are tested.

Figure 6 :
Figure 6: The effects of β used in causal inference.Results are evaluated on NoiseQA with XLMR-base tuned by xTune.

Figure 7 :
Figure 7: Gains over the results of XLMR-large with vanilla fine-tuning on the development set of SQuAD.abs.: absolute.

Table 1 :
Compared settings.X s and Y s are drawn from a source distribution and X t from a target distribution.xs ∈ X s , y s ∈ Y s , x t ∈ X t .
. For OIL, different values of α are tested.

Table 8 :
Adaptation time (in seconds) on each subset of XQuAD and NoiseQA.