Test-Time Self-Adaptive Small Language Models for Question Answering

Recent instruction-finetuned large language models (LMs) have achieved notable performances in various tasks, such as question-answering (QA). However, despite their ability to memorize a vast amount of general knowledge across diverse tasks, they might be suboptimal on specific tasks due to their limited capacity to transfer and adapt knowledge to target tasks. Moreover, further finetuning LMs with labeled datasets is often infeasible due to their absence, but it is also questionable if we can transfer smaller LMs having limited knowledge only with unlabeled test data. In this work, we show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data. In particular, we first stochastically generate multiple answers, and then ensemble them while filtering out low-quality samples to mitigate noise from inaccurate labels. Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets with higher robustness across diverse prompts, enabling LMs to stay stable. Code is available at: https://github.com/starsuzi/T-SAS.


Introduction
Language models (LMs) have gained the ability to learn generalizable representations that are applicable to diverse tasks by being trained on massive text corpora with increased parameters (Brown et al., 2020;Kojima et al., 2022).Moreover, to enhance the transferability to unseen tasks, LMs are further fine-tuned on instructions that are verbalized from a vast amount of the supervised datasets, showing a remarkable zero-shot ability across a wide range of tasks (Wei et al., 2022a;Sanh et al., 2022).
However, despite their ability to store a vast amount of general knowledge across diverse tasks, LMs show suboptimal performances on specific downstream tasks when transferring and adapting  their knowledge to target tasks.One possible solution is to additionally fine-tune LMs, but this is often impractical in realistic scenarios where labeled datasets are scarce.Furthermore, while large LMs with hundreds of billions of parameters may solve specific target tasks without fine-tuning, they are rarely accessible.Motivated by these challenges, we focus on investigating the self-adaptive capabilities of smaller LMs during the test-time.
While several studies have shed light on the potential for self-improvement of LMs, they mainly focus on augmenting the supervised labeled data with additional self-generated labels (He et al., 2020;Chen et al., 2023;Wang et al., 2023b), which significantly differs from ours with a more challenging setup that does not rely on labeled data.Note that there exists a recent work (Huang et al., 2022) that has shown the self-adaptive ability of large LMs with 540B parameters by further training with their generated answers, focusing on the reasoning tasks.However, the utilization of large LMs requires substantial costs and restricted accessibility.Therefore, our attention shifts towards smaller LMs, whose potential to be adapted to target tasks with their limited capability of storing knowledge remains largely unexplored.
Note that the adaptation of smaller LMs to downstream tasks brings forth new and critical challenges.First, it seems evident that large LMs with extensive knowledge are adept at facilitating selfadaption.However, the question arises regarding the self-adaptive capability of smaller LMs, specifically when the supervised labeled datasets are absent.Second, it may be suboptimal to use all the self-generated labels for training, as some of them could contain possibly incorrect information, leading to significant performance degradation (Zhou et al., 2023).Third, unlike the large LMs, adapting smaller LMs to specific tasks would require the incorporation of external knowledge, due to their limited capabilities of storing specific knowledge.
In this work, we aim at addressing the challenges of self-adaptive abilities in smaller LMs without additional labeled datasets, focusing on the QA tasks augmented with external knowledge.To this end, we first stochastically generate multiple answers according to a given unlabeled question with its related document.Then the most plausible answer is selected through majority voting, serving as a pseudo-label for training during test-time.Note, however, that we should be aware of the possibility of unreliable results from the self-ensemble.
To mitigate this, we further propose to filter out potentially incorrect samples based on low agreement among self-generated labels, as illustrated in Figure 1.We refer to our proposed method as Test-time Self-Adaptive Small LMs (T-SAS).We validate our T-SAS on QA datasets with smaller LMs, on which T-SAS significantly improves the self-adaptive capabilities of smaller LMs.
Our contributions and findings are threefold: • We first explore the self-adaptive capability of the smaller LMs in a realistic setting only with the unlabeled data during the test-time.• We ensure that the high quality of the selfgenerated labels is maintained by proposing a novel self-ensemble scheme with filtering.• We show that our T-SAS method achieves outstanding performance on the QA tasks.

Related Work
Language Models Pre-trained language models have shown decent advancements in diverse tasks (Brown et al., 2020) and further improved by increasing the number of parameters to billions (Touvron et al., 2023;Anil et al., 2023) and leveraging instruction-finetuning techniques with a vast amount of supervised labeled datasets (Wei et al., 2022a;Sanh et al., 2022).Despite their recent successes (Wei et al., 2022b;Wang et al., 2023a), however, we see that LMs still encounter difficulties in effectively addressing downstream tasks due to their limited task-specific adaptation.
Self-adaptive LMs Due to the frequent occurrence of domain or data shift in real-world scenar-ios, self-adaptive models have gained substantial attention (Wang et al., 2021a;Shu et al., 2022;Veksler, 2023).In particular, some work has suggested to augment supervised labeled training data with self-generated labels (He et al., 2020;Chen et al., 2023;Wang et al., 2023b).In contrast, we assume a more realistic test-time setup without labeled data.Note that there are recent studies on self-consistent LMs.Specifically, Huang et al. (2022) demonstrated that large LMs can be further trained with the most consistent labels among multiple self-generated labels.On the other hand, our focus lies on more practical and accessible smaller LMs with novel strategies of answer sampling and filtering.While some research focuses on selfconsistent prompts by regularizing LMs to generate consistent outputs across different prompts (Zhou et al., 2022;Zeng and Gao, 2023;Wan et al., 2023), the focus is different and orthogonal to ours which proposes to self-adapt smaller LMs for specific target tasks associated with external knowledge.
Self-adaption for Extractive QAs Several previous studies explored the self-adaptive capabilities of the traditional pre-trained language models, but they mainly addressed classification problems under the extractive setting (Li et al., 2020;Shakeri et al., 2020;Banerjee et al., 2021;Wang et al., 2021b;Ye et al., 2022).However, we focus on the generative setting, which makes a large difference due to fundamentally different objectives.To be specific, in an extractive setting, self-adaptation is based on probabilities, while in a generative setting, self-adaptation is done using generated text.
In situations where filtering is further applied, filtering is based on probabilities in the extractive setting, whereas, it is done using the generated text in the generative setting.Furthermore, Li et al. (2020) and Shakeri et al. (2020) assume an unsupervised QA setting, where the context-questionanswer triplets are not available, thus requiring an additional query-generation module.Such a pair generation approach is different and orthogonal to ours, since we aim to enhance answer generation directly from the provided context and question.

Preliminaries
Question Answering Let D train be a labeled QA training set, where each instance consists of a question q i , a gold answer a * i , and its associated doc- Similarly, an unlabeled test set is defined as follows: Assume that LM is a language model parameterized with θ, which has been instruction-finetuned on massive datasets.Then, given a pair of the question and its relevant document (q i , d i ), LM generates an answer, as follows: āi = LM(d i , q i ; θ).Note that, in order to generate a correct answer, i.e., a * i = āi , it is beneficial to train LM with a labeled training set by minimizing the loss (e.g., cross-entropy) between a correct answer a * i and model prediction āi as follows: L(a i * , āi ).Then, after a training phase, LM is more likely to correctly predict a * i on the unlabeled test set D test of the target task.
Self-Adaptive LM While recent LM is capable of answering questions, directly using LM to the target task may yield suboptimal results, thus requiring transfer learning or adaption.In order to do so, our idea is to maximally leverage the unlabeled test set D test that we have in hand for the target task.In particular, we have D test , and given that, a possible solution is to train LM with its self-generated pseudo label, āi * .In other words, LM can be further trained on D test_self = {(q i , d i , āi * )}, where āi * is generated from the unlabeled test sample (q i , d i ) with LM, for self-adaption to target tasks.

Test-time Self-Adaptive Small LMs
We describe our test-time self-adaptive LMs (T-SAS) with proposed strategies for effectively generating and utilizing pseudo labels during test-time.
Stochastic Self-Ensemble Note that relying on a single āi * generated by LM, which is trained on general domains, may result in inaccurate predictions when adapted to the target task, as there exists a possibility of incorrectly self-generated āi * .To mitigate this, we propose to make LM generate multiple answers { ā i,j } n j=1 with diverse points of view.Note that, while existing work (Huang et al., 2022;Wang et al., 2023a) proposed to use Top-k or nucleus sampling (Fan et al., 2018;Holtzman et al., 2020) when generating { ā i,j } n j=1 for reasoning tasks, their diversities might be limited due to answer sampling based on a single representation.Instead, we propose to leverage multiple representations generated through Monte-Carlo (MC) dropout (Gal and Ghahramani, 2016) by randomly masking weights on LM during test-time, as follows: ) where M is a distribution of mask weights and M is a sampled mask weight.Once we have generated multiple answers { ā i,j } n j=1 , our next objective is to assign one pseudo label āi * from the set using a majority voting strategy, which selects āi * with the highest number of occurrences among { ā i,j } n j=1 .After acquiring the self-generated label āi * for the unlabeled D test , we now aim at training LM on the D test_self with a following loss term: L( āi * , āi ).
Filtering However, in contrast to Huang et al. (2022) and Wang et al. (2023a) who use all samples and their associated pseudo labels āi * for training, relying entirely on them can be largely prob- lematic since LM lacks specific training to make valuable predictions on particular tasks.Therefore, motivated by the fact that the substantial performance improvements of LMs are primarily attributed to their training on high-quality data (Zhou et al., 2023), we further propose an automatic filtering strategy to identify and exclude samples, {(q i , d i , āi * )} ⊂ D test_self , that are likely to have incorrect āi * .We determine this by removing samples labeled with āi * that have a vote count proportionally lower than a certain threshold.

Experimental Setups
In this subsection, we describe experimental setups.Further details are shown in Appendix A.

Baselines and Our Model
We compare our T-SAS against other baselines using unlabeled test data.We use the FLAN (Chung et al., 2022) and the same prompt across all models.1) Finetuned w/ Training Set is an indicator model, which is finetuned on the labeled training set.2) Naïve LM w/o Ext. is a naïve baseline without external knowledge and self-adaptive training.3) Naïve LM is a baseline without self-adaptive training, but incorporates external knowledge.4) Self-Adaptive w/ Greedy is trained on the self-generated pseudolabels via Greedy decoding.5) Self-Adaptive w/ Soft is trained on the result of soft voting.6) Self-Adaptive w/ LMSI is trained on the result of majority voting, using Top-k sampling (Huang et al., 2022).7) T-SAS (Ours) is ours, which incorporates both self-ensembling and filtering strategies.

Results
Here, we show the overall performance of T-SAS.
Please see Appendix B for more results.

Main Results
As Table 1 shows, T-SAS significantly outperforms all baselines of varying sizes, particularly with a FLAN-Large model.Note that Naïve LMs show substantially lower performance than the supervision finetuned LMs.This corroborates our hypothesis that LMs are not fully optimized for target tasks.Also, by integrating external knowledge, the performance of all models, especially smaller ones, is largely improved.Furthermore, ensembling multiple selfgenerated predictions significantly enhances performance, compared to baselines with Greedy decoding.This indicates that T-SAS, considering diverse points of view, reduces the likelihood of encountering performance-degrading scenarios caused by relying on a single prediction.
However, when compared to the baselines with multiple self-generated predictions (i.e., Self-Adaptive w/ Soft and w/ LMSI), the results indicate that a filtering strategy is required.Solely relying on the final result of the self-ensemble should be avoided with smaller LMs, contrasting the observation by Huang et al. (2022) with large LMs (540B) on the reasoning task.Interestingly, the model trained on soft labels, which must leverage all the predictions, shows even lower performance than the Naïve LMs, emphasizing the negative impact of training on inaccurate self-generated labels.
Comparing the performance with large LMs is outside our scope, since our work targets at investigating the effectiveness of smaller LMs on selfadaption.Note that large LMs and smaller LMs are not directly comparable to each other, due to their discrepancy in capacity.However, we additionally report the performance of the large zero-shot LM as an indicator in Figure 2. Surprisingly, our T-SAS significantly outperforms a much larger zero-shot LM, which further signifies the effectiveness of the proposed self-adaptive strategies for smaller LMs.Robustness on Diverse Prompts Recent LMs have been observed to be prompt-sensitive (Cho et al., 2023;Ishibashi et al., 2023), which is undesirable as consistent outputs are expected across similar prompts, particularly in real-world scenarios with diverse users.As shown in Figure 3, T-SAS shows substantial robustness to diverse prompts, which are attributed to the proposed stochastic generation and filtering strategies.In other words, T-SAS effectively mitigates the impact of inaccurate predictions by filtering them from multiple perspectives, even for specific prompts.
Effectiveness on Specific Domains In addition to evaluating T-SAS in general domains, we further conduct experiments on specific domains, including the science domain, SciQ (Welbl et al., 2017), and the clinical domain, cpgQA (Mahbub et al., 2023), to assess its domain adaptability.Table 2 shows consistent improvements from T-SAS, highlighting its ability to capture and enhance transferability across domain shifts.These findings indicate the applicability of T-SAS in domains where high-quality labeled data is scarce.
Effectiveness on Unseen Datasets Recent instruction-finetuned language models have been extensively trained on various QA datasets, making it challenging to evaluate them on new datasets.Therefore, in Table 2, we further show clear improvements even on the unseen datasets, cpgQA and TyDiQA (Clark et al., 2020).Also, it is worth noting that the performance improvement achieved by applying self-adaptation to the already trained data is not our weakness, but rather a strength.To be more specific, even though an LM has been trained comprehensively on diverse datasets including the target dataset, the LM remains as the general-purpose model and is not tailored to the specific target dataset.However, by using our proposed T-SAS on the target dataset, the model can further achieve improved performance thanks to its self-adaptation capabilities over the target dataset.

Stochastic Generation Strategy with Filtering
We compare MC dropout and Top-k sampling with varying filtering thresholds.As shown in Figure 4, both strategies benefit from a filtering strategy by mitigating noises introduced during stochastic generation processes.Moreover, MC dropout consistently outperforms Top-k sampling, which can be attributed to its higher lexical diversity (0.24 vs. 0.14).These findings suggest that diversity, combined with a filtering strategy, allows LMs to effectively identify and remove low-quality outputs with higher variances, consequently resulting in overall performance improvement.

Ablation Studies
In order to see how each of the stochastic generation and filtering strategies contributes to the overall performance, we provide the ablation studies with two variants of T-SAS with Large and XL sizes.To be specific, in T-SAS w/o Stochastic, low-quality samples are filtered out based on the generation probability of a single prediction, and in T-SAS w/o Filtering, the majority voting results are directly used as pseudo-labels without applying our proposed filtering strategy.As shown in Table 3, both of the proposed strategies positively contribute to the overall performance, by substantially improving the performance of Naïve LMs in both model sizes.Furthermore, these strategies indicate that they are in a complementary relationship, suggesting that they work together to enhance overall performance.

Conclusion
In this work, we investigated and improved the self-adaptive capabilities of the recent smaller LMs only using unlabeled test data, on the QA task.Specifically, our proposed method involves selfensembling the stochastically generated labels and a filtering strategy to remove possibly incorrect labels, thereby enabling training with high-quality self-generated labels.The experimental results and analyses indicate that our method significantly improves QA performances during test-time.

Limitations
While we show clear advantages of using our T-SAS to address the realistic challenges of the recent LMs regarding their self-adaptive capabilities, it is important to acknowledge that our validation was conducted under the assumption of having gold external documents containing the answers.However, in real-world scenarios, obtaining such gold documents may not be feasible, necessitating the incorporation of additional retrieval modules to retrieve query-relevant information.Although the integration of retrieval modules is beyond the scope of our current work, our exploration of the potential benefits of an external knowledge-augmented setting for self-adaptive smaller LMs opens up fruitful avenues for future research.We also believe that investigating the integration of retrieval modules with T-SAS to further enhance the practical applicability of self-adaptive LMs in real-world applications holds a significant value.

A Experimental Setups
Datasets We use three QA datasets, augmented with external documents from Wikipedia, preprocessed by Karpukhin et al. (2020).1) Natural Questions (NQ) (Kwiatkowski et al., 2019) is composed of questions from the Google Search engine.
Implementation Details For a fair comparison, we compare the models using the instructionfinetuned FLAN-T5 (Chung et al., 2022) model with three different sizes, Base (250M), Large (780M), and XL (3B) with the same prompt: 'Read this and answer the question{context}{question}'.For the models larger than 3B, we trained them adopting a low-rank adaptation (LoRA) method (Hu et al., 2022), For hyperparameters, we set the training epoch as 5 for all the self-adaptive models and 1 for the indicator model.Also, we set the number of stochastically generated predictions as 15 and set the filtering threshold as 0.7.

B Additional Experimental Results
Dropout Mask Variations To investigate the impact of the number of masks in MC dropout on performance, we conduct experiments with varying mask numbers.Figure 5 illustrates that increasing the number of masks leads to improved performance, but stabilized after reaching a certain number of masks.Note that the stochastically perturbed models, with multiple dropout masks, offer more diverse perspectives compared to the ablated model without stochastic generation or the model with only one dropout mask, thus resulting in decreasing the possibility of selecting inaccurate self-generated answers.
Effectiveness on Diverse Model Sizes In addition to the performance on three sizes shown in Table 1, we additionally conduct experiments for the FLAN-Small (80M) and FLAN-XXL (11B) models.The results in Table 4 demonstrate that our T-SAS consistently enhances performances on the much smaller or larger model sizes.
Effectiveness on T0-3B In addition to the FLAN series, we conduct experiments on another LM, T0 (Sanh et al., 2022) with 3B size.As shown in Table 5, our T-SAS consistently improves the performance on three QA datasets, which indicates the applicability of T-SAS on diverse LMs.

Effectiveness of Augmented External Document
We have observed significant performance improvement with the augmented external documents in all models as shown in Table 1.Here, we further analyze the importance of augmenting external documents, especially for the models that require training.As shown in Table 6, the overall performance without external knowledge is largely devastated for all models.Interestingly, the performance degradation for an indicator model trained on the supervised training dataset is remarkable.These findings corroborate our proposed challenge that external knowledge is necessarily required for training self-adaptive and especially smaller LMs, whose capabilities of storing specific knowledge are highly limited.
Case Study We conduct a case study, mainly comparing our T-SAS against a self-adaptive baseline model with LMSI, in Table 7.The first example shows the robustness of our T-SAS approach in addressing prompt-sensitive situations, where an LM exhibits significantly degraded performance for a specific prompt, 'Read the following article and answer the question.Article: {} Question: {}'.While both models stochastically generate in- t Help Falling in Love' is a song recorded by American singer Elvis Presley for the album 'Blue Hawaii' (1961).It was written by Hugo Perett ...

Figure 1 :
Figure 1: Illustration of our proposed T-SAS that includes self-ensemble and filtering strategies for self-adapting LMs.

Figure 5 :
Figure 5: F1 scores with varying numbers of dropout masks.

Table 1 :
Exact Match (EM) and F1 scores on three QA benchmark datasets with varying sizes of FLAN.
ument d i that contains a * i , as follows:

Table 4 :
Results of FLAN-Small and FLAN-XXL, on NQ.