Leveraging Multiple Teachers for Test-Time Adaptation of Language-Guided Classifiers

Recent approaches have explored language-guided classifiers capable of classifying examples from novel tasks when provided with task-specific natural language explanations, instructions or prompts (Sanh et al., 2022; R. Menon et al., 2022). While these classifiers can generalize in zero-shot settings, their task performance often varies substantially between different language explanations in unpredictable ways (Lu et al., 2022; Gonen et al., 2022). Also, current approaches fail to leverage unlabeled examples that may be available in many scenarios. Here, we introduce TALC, a framework that uses data programming to adapt a language-guided classifier for a new task during inference when provided with explanations from multiple teachers and unlabeled test examples. Our results show that TALC consistently outperforms a competitive baseline from prior work by an impressive 9.3% (relative improvement). Further, we demonstrate the robustness of TALC to variations in the quality and quantity of provided explanations, highlighting its potential in scenarios where learning from multiple teachers or a crowd is involved. Our code is available at: https://github.com/WeiKangda/TALC.git.


Introduction
Inductive learning from examples has underpinned many successful machine learning applications.However, classifiers trained solely from labeled examples often struggle to generalize in scenarios with limited labeled data.In contrast, humans can learn new concepts through natural language conversations (Chopra et al., 2019;Tomasello, 1999).Inspired by this phenomenon, recent approaches use natural language explanations, instructions, and prompts to train language-guided classifiers (Srivastava et al., 2017;Andreas et al., 2018;Murty et al., 2020;Wang* et al., 2020;Ye et al., 2020).While these classifiers can perform zero-shot classification, they have several limitations.Firstly, they

Ỹ
Figure 1: TALC leverages data programming to perform test-time adaptation of a language-guided classifier.Natural language explanations (E = {e 1 , e 2 , e 3 }) provided by multiple teachers and unlabeled examples (X 1:n ) for a new task are fed to a language-guided classifier pair-wise resulting in multiple pseudo-labels for the unlabeled examples.TALC uses a graphical aggregation to weigh the pseudo-labels from different explanations to decide the final predicted label ( Ŷ ).TALC is highly flexible in aggregating labels as it can conceptually consider a broad variety of factors, such as the complexity of explanations, consistency between explanation predictions, identity of the explanation provider, etc. lack a principled strategy for weighing language supervision from multiple sources (or teachers).Secondly, they fail to utilize the available unlabeled data for a new task during inference.Additionally, the impact of the quality of sources, and the inclusion of low-quality explanations, remains largely unexplored.
To address these limitations, we present TALC (Test-time Adaptation of Language-guided Classifiers), a framework for adapting language-guided classifier on novel tasks during inference, also known as test-time adaptation.TALC assumes a priori access to the entire test set (unlabeled samples) of the novel task and the task-specific explanations, which aligns with real-world situations, such as developing a product-category classifier for an e-commerce platform.In the context of TALC, the multiple explanations available for each task are considered as distinct supervisory signals.
Leveraging the power of data programming (Ratner et al., 2018b), TALC aggregates and combines the supervision provided by these multiple explanations effectively.
Figure 1 illustrates the TALC framework.TALC uses a subset of the test data, called the adaptation set, for adapting the language-guided classifier.For each pair of explanation and test example in the adaptation set, a pseudo-label is generated using the base language-guided classifier.TALC learns a label aggregator on the pseudo-labels generated for the adaptation set using EM ( §3).The label aggregator is trained to consider the contribution of each explanation during adaptation, thus, in principle, allowing it to weigh different sources of language supervision.Finally, TALC uses the learned aggregator over the entire test set to obtain final predictions for test set examples.
We evaluate TALC on six classification tasks from the CLUES-Real dataset (R. Menon et al., 2022), where each task is paired with natural language explanations ( §4).TALC outperforms strong baselines by 3.3% on average (absolute).Through qualitative and quantitative analysis, we investigate TALC's robustness with respect to the size of the adaptation set, number of explanations, and explanation quality.In the subsequent sections, we describe TALC in detail ( §3), present experimental results and analysis ( §4), and conclude by discussing our contributions, limitations, and ethical considerations.Our contributions are: • We introduce TALC, a test-time adaptation framework, that uses label aggregation to improve language-guided classifiers.• We demonstrate the effectiveness of TALC on multiple real-world classification tasks from CLUES-Real (R. Menon et al., 2022).• We present comprehensive analyses to evaluate the robustness of TALC w.r.t. the quantity and quality of explanations.

Related Work
Learning From Language Using natural language explanations to inform or train classifiers has garnered significant interest in recent years (Goldwasser and Roth, 2014;Srivastava et al., 2017;Hancock et al., 2018;Murty et al., 2020).While Murty et al. (2020) enhance supervised BERT models (Devlin et al., 2019) for relation extraction tasks, other approaches employ language explanations for fewshot learning.For instance, Hancock et al. (2018) convert explanations to labeling functions via semantic parsing, leveraging unlabeled data for weak labels.More recently, R. Menon et al. (2022) utilize natural language explanations in an entailmentbased model for classification decisions.
Test-time Adaptation Test-time adaptation has been extensively studied in computer vision by employing batch-normalization statistics (Nado et al., 2020;Khurana et al., 2021;Schneider et al., 2020), test-time entropy minimization (Wang et al., 2020;Sivaprasad and Fleuret, 2021), prediction consistency maximization over augmentations (Zhang et al., 2021), and classifier adjustment (Iwasawa and Matsuo, 2021).In the realm of natural language processing, Banerjee et al. ( 2021) explore test-time adaptation for question-answering using self-supervision.In contrast, we introduce a new test-time adaptation approach that leverages data programming to adapt a base language-guided classifier during inference by utilizing natural language explanations.
Data Programming Data programming (Ratner et al., 2017) employs a combination of multiple labeling functions and generative models to create probabilistic training labels for unlabeled datasets.
Prior work (Ratner et al., 2018b;Hancock et al., 2018) has demonstrated successful applications of this paradigm to create systems that allow users to label large datasets programmatically.Here, we repurpose data programming in the test-time adaptation setting to improve classifiers on unseen tasks.

TALC
In this section, we present the details of our framework, TALC.TALC leverages data programming to adapt a base natural language explanation-guided classifier on a novel task during inference.
Problem Setup.We assume a language-guided classifier, M LC , which can take an explanation e from a teacher and example X to predict a label M LC (X, e) .A language-guided classifier refers to a classifier that utilizes one or more natural language explanations to make predictions.Our objective is to make predictions for a batch of test samples, represented as {X test , Y test } 1:n , where Y test represents the unobserved ground-truth labels corresponding to the samples in X test , and n denotes the number of samples.During test-time adaptation, our aim is to effectively adapt the classifier to the specific task at hand and infer the true labels for X test .Existing methods for test-time adaptation typically assume an online setting, where examples are processed one at a time (Sun et al., 2020;Banerjee et al., 2021).In contrast, we assume a priori access to the entire test set of the task.This assumption allows us to leverage the empirical distribution of the unlabeled data for semi-supervised learning.
Our setting aligns with real-world scenarios, such as developing a product-category classifier for an e-commerce platform, where the complete database of products (including the test set) is known in advance.For situations where test samples are observed one at a time, it is still possible to utilize TALC for adapting a base classifier.This involves a "warm-up" phase, where the base classifier is used off-the-shelf for a few samples, followed by adaptation using TALC.While this usage scenario is not the primary focus of our work, we provide a description of how TALC can be employed in such cases in Appendix §A for brevity.
Overview.As depicted in Figure 1, for a new task T new , we are provided with m natural language explanations E = {e 1 , e 2 , . . ., e m }, and a set of examples {X i ∈ X test }.To generate the classifier outputs, we iterate through each explanation e j for every example X i , and compute M ij := M LC (X i , e j ).This yields a labeling matrix M with a shape of n × m.Next, we introduce a test-time adaptation procedure: TALC, to compute the final labels Ỹ utilizing the M .This procedure essentially implements a function f : M ∈ R n×m → Ỹ ∈ R n , which we describe in the rest of this section.
Test-time Adaptation.The objective of TALC is to adapt the language-guided classifier, M LC , on a novel task, T new during inference.We illustrate the adaptation procedure in Algorithm 1. First, we split the test set into two disjoint sets -the adaptation We also partition the labeling matrix M into M adapt and M held−out by choosing the rows corresponding to samples in the adaptation set and held-out set, respectively.
To model the dependence between the (latent) inferred labels and M adapt , we use data programming techniques (Ratner et al., 2019) to train a label aggregator, L agg w , with task-specific parameters w.We use the learned parameters (which correspond to weights learned for each explanation) to aggregate predictions in M (both M adapt and M held−out ) and infer the labels, ỸT ALC .
Label Aggregator.The label aggregator is a graphical model that defines the joint probability of explanations E, examples X and latent (true) labels Y for a given language-guided classifier as: (1) Here, ϕ is a feature-vector of features that can be computed using X, E, Y and M LC and w is a weight vector corresponding to each of those features.In general, this can subsume a very broad range of features, including features that can indicate the complexity of an explanation, or its provenance 1 .We also note that in particular, since the labeling matrix M is computed from X, E and M LC , ϕ can include features that depend on M and Y .For simplicity, our instantiation incorporates the labeling rates of each explanation (how frequently an explanation assigns a label2 ) and the correlations between the pseudo-labels from different explanations to estimate the accuracies of each individual explanation in an unsupervised manner.Specifically, the label aggregator is defined in terms of two types of features: accuracy (ϕ Acc ∈ R n×m ) and propensity (ϕ P rop ∈ R n×m ).Each value in ϕ Acc and ϕ P rop is defined as: where y i is the label for i th sample and y abstain is a special label that denotes M LC has abstained from predicting a label based on the j th explanation.The accuracy factor fires if the inferred label (Y i ) for an unlabeled example X i matches the predicted label from an explanation j.The propensity factor fires whenever the classifier doesn't abstain from predicting a label from an explanation.Since here we only define two types of features for each explanation, w ∈ R 2m is a learnable vector corresponding to weights for the accuracy factor and propensity factor for each explanation.The weights are learned by maximizing the loglikelihood log P (X, E) = log Y P w (X, E, Y ) using the expectation-maximization (EM) algorithm (since we don't have ground-truth labels for Y at test-time).We compute the MAP estimate ỸT ALC := argmax Y P ŵ(Y |X, E) using Gibbs sampling to predict the final labels.Note that while we learn the weights ŵ on the adaptation set (line 4 in Algorithm 1), the learned weights are used to aggregate predictions in both the adaptation and the heldout examples, to predict the labels, ỸT ALC (line 5 in Algorithm 1).We implement the label aggregator using Snorkel-Metal3 (Ratner et al., 2018a).Appendix §C provides task-specific details of the label aggregator training.

Experiment and Analysis
In this section, we evaluate the zero-shot adaptation performance of TALC on classification tasks, followed by a detailed analysis of TALC's robustness.

Data
We assess the performance of TALC on real-world classification tasks from the CLUES (R. Menon et al., 2022) benchmark.Out of the sixteen realworld tasks in the test split of CLUES, we focus on six tasks for evaluation due to the limited number of test samples (< 10) in the remaining tasks, which restricts their suitability for test-time adaptation.Figure 1 presents an illustrative example showcasing the nature of these tasks and provides examples of the corresponding natural language explanations.Appendix §B provides further details regarding the six tasks selected for evaluation.
We utilize the ExEnt model (R. Menon et al., 2022) as the base language-guided classifier (M LC ) in alignment with our choice of the CLUES dataset4 .The ExEnt model leverages textual entailment to establish the correspondence between explanations and tabular inputs, enabling label predictions.To aggregate predictions from multiple explanations, ExEnt adopts a mean-pooling strategy, aggregating the predictions obtained from each explanation-input pair to derive the final label.It's important to note that ExEnt is not trained for abstention, meaning it always assigns a label regardless of the quality of the explanations.In §4.4,we further explore the scenario of abstention, which we consider a more realistic use case for languageguided classifiers.

Baseline and Evaluation Metrics
We compare TALC against the following baselines: 1. ExEnt: This refers to the base ExEnt model (R. Menon et al., 2022) that has been trained on real-world training tasks from the CLUES dataset.2. ExEnt-MV: For each example X i , we generate a set of pseudo-labels corresponding to each of the m task explanations.The final predicted label is determined by selecting the label that appears most frequently among the m pseudolabels (majority vote).Unlike ExEnt, which uses mean-pooling for aggregation, ExEnt-MV applies a mode-pooling operation.3. ExEnt-FT: Similar to our approach of finetuning TALC with the predicted labels from the label aggregator, L agg w , we also include a selftraining baseline approach by fine-tuning ExEnt.This involves utilizing ExEnt's own predictions as labels on the adaptation set.We use classification accuracy as the evaluation metric to compare the utility of different methods.

Results
Table 1 shows the zero-shot classification accuracy of TALC and the baselines on the six evaluation tasks.The findings reveal several key insights.Firstly, we observe that majority voting (ExEnt-MV) performs better than vanilla ExEnt on average across the six tasks.Secondly, fine-tuning ExEnt on its own predictions (ExEnt-FT) results in better zero-shot accuracies than the base ExEnt model, demonstrating the value of self-training on unlabeled data.Furthermore, the performance of ExEnt-FT increases with an increase in the amount of test data used for adaptation (35.7 → 36.8 as we increase the adaptation ratio from 0.5 → 1.0).
We note that TALC obtains better performance on average across all evaluation tasks compared to the three baselines.Specifically, TALC improves the accuracy by around 3.3% on average (absolute) over the state-of-the-art ExEnt model.In fact, both TALC variants, at adaptation ratio 0.5 and 1.0, perform better than ExEnt on all tasks except for indian-liver-patient.The utilization of the label aggregator in TALC results the biggest improvement (∼ 25% relative) on the tic-tac-toe-endgame task.We attribute this improvement to the label aggregator's ability to give higher weightage to highquality explanations, resulting in more accurate predictions.For the tasks where the performance of TALC is close for α = 0.5 and α = 1.0 in Table 1, we observed that the aggregation weights for each explanation learned by the data programming framework are roughly similar for the two settings.As a result, the aggregation over the pseudo labels in the labeling matrix produces similar final predictions and hence similar accuracies.

Analysis
Abstention.Our previous experimental results treat labels from individual explanations the same, irrespective of the confidence of the model in its predictions on those examples.This is because the base-language classifier used in experiments, ExEnt, always chooses a label during inference rather than performing selective predictions.However, the TALC framework allows for differential modeling of abstentions, where a model can choose to refrain from assigning a class label if the explanation does not apply to the example.To explore this, we design a variant of ExEnt, referred to as ExEnt-A, that can refrain from assigning a class label during inference.This is straightforward since ExEnt is based on NLI (R. Menon et al., 2022), where a neutral label can be naturally mapped to an abstention.We train ExEnt-A on the same tasks as ExEnt with the modification of having 'abstain' as an additional class label for each task.
Table 2 shows the results of TALC-A, and the baselines when abstention is allowed ('-A' denotes abstention).We find that TALC-A achieves the best overall accuracy.More importantly, comparing Table 1 and Table 2, we observe that TALC has a smaller drop in performance in comparison to ExEnt and ExEnt-FT suggesting that TALC is better at adapting to multiple teachers even when certain teachers choose to abstain from prediction.Effect of adaptation set size.We analyze the performance of ExEnt-FT and TALC by varying size of the adaptation set.Specifically, we vary the adaptation ratio, α, from 0.2 to 1.0 (in increments of 0.1) for all six evaluation tasks. 5ntuitively, we expect that the accuracy of ExEnt-FT and TALC to improve with increase in adaptation ratio.However, we empirically observe that the performance of ExEnt-FT fluctuates with change in α and does not show a consistent trend of improvement as shown in Figure 2.Meanwhile, as shown in Figure 2, we observe that a larger adaptation set enhances the performance of TALC from 37.6% → 38.8% as α increases from 0.2 → 1.0.
Robustness to number of explanations.Next, we analyze the robustness of ExEnt-FT and TALC to changes in the number of explanations provided for adaptation on the new task.We will refer to the fraction of explanations used for adaptation as the  With increase in number of explanations, the accuracies using TALC improve while performance of ExEnt-FT is not affected.
explanation ratio = # of available expls # of all expls .Specifically, we vary the explanation ratio from 0.2 to 1.0, by randomly choosing explanations without replacement, when training the ExEnt-FT model and the label aggregator in TALC.We keep the adaptation ratio (α) fixed at 1.0 for this analysis.
Figure 3 shows the variation in performance of ExEnt-FT and TALC with changes in the explanation ratio averaged over the six evaluation tasks.The accuracy of TALC drops when increasing the explanation ratio as 0.3 → 0.5, buts shows a consistent increasing trend (from 33.5% → 38.8%) when increasing the explanation ratio from 0.5 → 1.0.In contrast, the performance of ExEnt-FT fluctuates as the number of available explanations changes.This shows that TALC is comparatively more sensitive to the number of explanations used for adaptation.

Robustness to quality of explanations.
Here we analyze the role of explanation quality on the performance of TALC.However, quantifying the quality of explanations in the absence of annotations is a challenging and open research problem.To circumvent this issue, we explore two approaches to quantify explanation quality: • Individual explanation accuracy: Here, we assume there exists an oracle which has access to all the explanations, the base language guided-classifier, and the labeled examples.This oracle evaluates the accuracy of each individual explanation of the task by evaluating it on the labeled examples with the base language-guided classifier.We term this accuracy as the individual explanation accuracy and use it a proxy for quantifying the quality of an explanation.For each of the six evaluation tasks, we provide the individual explanation accuracies in Appendix §F.• Perplexity of an explanation: Assuming access to all labeled examples (needed for the above approach) may be unrealistic for many scenarios.Hence, we also explore a surface-level metric, the perplexity of the explanation, to quantify the quality of an explanation.We obtain perplexity scores for each explanation by using the GPT2-Large pre-trained model (Radford et al., 2019).
We provide perplexity scores of each explanation for the six evaluation tasks in Appendix §F.
These aforementioned approaches to quantify the quality of an explanation) can filter out poor quality explanations or selectively choose good quality explanations for adapting the base languageguided classifier.We explore the following scenarios (with adaptation ratio, α = 1) to understand the impact of explanation quality: • Using the top X percentage of explanations: We rank the explanations by accuracy or perplexity for each task and only use the top X percent of the ranked explanations for TALC, where X = 20, 40, 60, 80, 100.The results are shown in Figure 4. On average, we observe that TALC performs the best when using only the top 20% of explanations ranked by both accuracy and perplexity.As X increases from 20 → 40 → 60, the average performance of TALC decreases, and then keeps increasing.We attribute this trend to the fact that the training of the label aggregator may be sub-optimal with a smaller number of explanations, and improve with more explanations.These results also clearly show that the label aggregator is able to distinguish explanation quality.We note a roughly similar trend when the explanations are ranked by lowest perplexity instead of highest accuracy.This is an encouraging result, and indicates that perplexity of explanations can actually be a reasonable basis for filtering from a large pool of explanations.• Removing the best explanation: We remove the best (highest accuracy or lowest perplexity) ex- planation from the set of explanations for each the task and adapt TALC. Figure 5 shows that removing the best explanation hurts performance consistently across tasks, as expected.We observe a 1.3% in accuracy drop when ranking the explanations by accuracy and a 1.0% drop when ranking by perplexity on average across the six tasks (shown in Appendix §I).TALC: Adding the explanation with lowest accuracy w/o lowest-accuracy explanation w/ lowest-accuracy explanation Figure 6: Comparison of TALC's performance before and after adding a low-quality explanation to a set of high-quality explanations.On average, the performance decreases by 1.5% when ranking by accuracy.
• Adding a low-quality explanation to a set of high-quality explanations: Next, we study the impact of low-quality explanations on TALC.
For this, we consider two setups.In the first setup TALC utilizes just the top-3 explanations as per their individual accuracies.The individual explanations accuracies can be found in Appendix §F.In the second setup, TALC utilizes the top-3 and the worst explanation (as per individual explanation accuracy) for adaptation.Figure 6 shows the performance of TALC for these settings.When ranking by accuracy, the average decrease in performance due to the addition of low-quality explanation is 1.5%, demonstrating the robustness of TALC to low-quality explanations.We observe a similar trend in results when the explanations are ranking by their perplexity (details in Appendix §I).• Replacing best explanations with malicious explanations: Next, we create malicious explanations by flipping the labels mentioned by the original explanations.For example, taking the explanation from Figure 1 for the travel-insurance task, we convert 'most college graduates have taken travel insurance' to 'most college graduates have not taken travel insurance'.We repeat this process for the top-3 explanations ranked by accuracy or perplexity for each of the six evaluation tasks.The results in Figure 7 show a drop in performance of TALC (from 38.8% to 31.5%), as expected, when the top-3 explanations (ranked by their individual accuracies) are modified into malicious explanations.When explanations are ranked by perplexity, the results are similar (details in Appendix §I).Surprisingly, for the 'carevaluation' task, the performance increased from 16.5% to 21.4% on modifying the best explanations to malicious explanations when ranking by accuracy.From the average drop in performance, we can conclude that TALC is susceptible to text-based attacks that may occur through the explanations provided during adaptation.Future work can address the challenge of learning to distinguish between beneficial and adversarial explanations.
Agnostic nature of TALC w.r.t language-guided classifier.The flexibility of choosing different models as the underlying language-guided classifiers is an advantage of the TALC framework.The modular design of TALC, i.e, decoupling of (1) how we obtain predictions w.r.t each explanation using a language-guided classifier and (2) how we combine these individual predictions, makes TALC a highly generalizable and flexible framework.To empirically validate this flexibility, we experiment with different LLMs as the underlying languageguided classifier.Table 3: Comparison of accuracies between TALC and baselines when using different LLMs as the language-guided classifier on the 6 different tasks from CLUES-Real.We report the mean and standard deviation for the accuracy across three runs for adaptation-based methods.The numbers in bold indicate the best accuracy across methods.
ting, the prediction is done by prompting the LLM.
We provide the prompt templates in Appendix §D.
Table 3 shows that TALC outperforms both baselines for all of the three LLMs demonstrating the robustness of TALC to the choice of the underlying language-guided classifier.

Discussion & Conclusion
In this paper, we introduce TALC, a framework for test-time adaptation of language-guided classifiers that leverages multiple sources of supervision.One conceptual advantage of TALC is its agnosticism towards the choice of language-guided classifier, leaving room for future exploration with different models.TALC is flexible in terms of what aspects of explanations, teachers and unlabeled examples are used to train the label aggregator.While our approach here trains a label-aggregator for every new task (since our features for the label aggregator include identities of individual explanations), in principle it should be possible to train a unified label aggregator across tasks based on featurized representations of tasks and explanations.Scaling up TALC to new datasets with a larger number of tasks would provide valuable insights into its generalizability.Our experiments reveal TALC's susceptibility to malicious attacks and bad-faith actors, which future works can improve on.Despite these challenges, TALC suggests exciting opportunities for harnessing the collective wisdom of teachers in real-world applications.

Limitations
To analyse the impact of quality of an explanation during test-time adaptation, we use individual explanation accuracy as a surrogate measure for its quality in lieu of a standardized metric of explanation quality.However, developing standardized metrics to judge the quality of an explanation remains an open and pressing research challenge.
To analyse robustness of TALC w.r.t malicious explanations, we created malicious explanations by flipping the labels mentioned in the best explanation for a task.However, there could be other ways of creating malicious or adversarial explanations, which are more subtle than just flipping a label.For example, one subtle way of altering an existing explanation to a malicious one could be by establishing unwanted correlations between a protected attribute (e.g.gender) and the label for a downstream task (e.g.whether the loan should be approved).Analyzing and improving the robustness of TALC to more nuanced adversarial/malicious explanations remains to be explored.
The adapted model obtained by using TALC is task dependent, as it uses explanations and unlabeled data specific to the downstream task for adaptation (specifically, for training the label aggregator component).Hence, for every novel task for which we want a adapt a base language-guided classifier, we need access to explanations and unlabeled samples.This requirement (especially obtaining good explanations for adaptation) can be a challenging issue for some real-world scenarios.Improving TALC to reduce its dependence on the amount of explanations and/or unlabeled data while still retaining downstream accuracy (post-adaptation) is an interesting direction for future work.The base language-guided classifier used in our experiments, ExEnt, is designed to work with a maximum of 512 tokens in its context.Usage of longer context models or even large-scale pre-trained models remains to be explored.The effectiveness of TALC under multilingual setting is also unexplored.

Ethics and Broader Impact
The experiments described in this work are performed over tasks from a publicly available benchmark, CLUES (R. Menon et al., 2022).The data for these tasks do not contain any personally identifiable information.We do not collect or annotate any additional data.For all the experiments in the paper we evaluate using automatic metrics and do not perform any human evaluation.
TALC is agnostic to the base language-guided classifier.We do not foresee major risks with our framework if the inputs provided are appropriate.Like any other natural language guided method there are potential concerns of misguiding a model deliberately by providing erroneous inputs.Measures to detect such bad actors and rectifying erroneous inputs is beyond the scope of this work.However, there is a risk of classifiers perpetuating biases present in the input natural language explanations (for example, some explanations may describe the label in terms of sensitive or inappropriate features).Biased or discriminatory explanations can result in biased predictions and contribute to unjust outcomes.
The broader impact of this work can lead to development of frameworks that enable efficient adaptation of AI systems.Developing language-guided adaptable systems can improve the impact and usability of AI systems in daily life, especially on the long tail of tasks with limited labeled data.However, the responsible development and deployment of these models would require domain-specific expertise, involving collaboration with experts and stakeholders to understand the implications and ensure ethical considerations are met.Close attention should be paid to the specific contexts in which the classifiers are applied to minimize negative consequences and maximize positive impacts.

Appendix
A Usage of TALC 1.An example of real-world cases where the entire set of test samples can be realistically accessed: Let's consider the case of a product category classifier for products on the Amazon database.In this case, developers will first define classifiers using some training data and deploy the classifier on the entire database to label examples.
2. Method to use TALC when test samples are observed one-by-one: Even if we do not have access to the entire test set and the classifier observes unlabeled samples one by one, TALC can be deployed in practice as: (a) For a predetermined number of samples, the language-guided classifier is deployed off-the-shelf (Note: In this work, this would be the same as using ExEnt for those samples).(b) The aforementioned samples can now be pooled together as an adaptation set, and we can adapt the language-guided classifier using TALC.
In other words, we incur a "warm-up" phase, where the un-adapted classifier is used, following which we adapt the classifier using TALC by considering the set of samples observed during warm-up as an adaptation set.

B Details of evaluation tasks
We use 6 real world classification tasks from R. Menon et al. (2022) as our evaluation tasks.The tasks considered are -uci/banknote-authentication, uci/tic-tac-toe-endgame, uci/car-evaluation, uci/contraceptive-method-choice, uci/indianliver-patient, and kaggle/travel-insurance. Examples of these tasks can be found at the CLUES website with the following link: https://clues-benchmark.github.io.Among the above tasks, uci/car-evaluation and uci/contraceptive-method-choice are multi-class classification tasks while the rest tasks are binary classification tasks.The numbers of examples in test set of each task are 275,195,346,295,115,398   uci/tic-tac-toe-endgame, uci/car-evaluation, uci/contraceptive-method-choice, uci/indian-liverpatient, and kaggle/travel-insurance respectively.

C Hyperparameter and Compute Details
We train the ExEnt model following the hyperparameters in R. Menon et al. (2022), e.g. a learning rate of 1e-5 for 5 epochs, batch size of 2, and evaluation batch size of 16.For the label aggregator training, we did hyper-parameter search for each of the tasks and report the best hyper-parameters in Table 5.
For fine-tuning the ExEnt model, compute time ranged from 1 hr for the shortest jobs with smaller data sizes to 2 hours on 1 RTX 2080Ti GPU.For fine-tuning the label aggregator, compute time is within 1 minute.

D Prompt Templates for LLM Experiments
For the experiments using large language models, we used the prompts elaborated in Table 4.

E Learned Label Aggregator Explanation Weight
We analyze the learned values of the weights in the label aggregator, L agg w , to interpret the contribution of each explanation towards the final prediction of TALC at an adaptation ratio of 1.0.First, we calculate the accuracy of each individual explanation of a task by using it with ExEnt for classification on the entire test set.These individual explanation accuracies serve as a proxy for their relative quality.The visualization for the 6 datasets' learned explanation weights of label aggregators and the learned explanation weight trends are shown in Figure 9.
Here, we also show the average learned explanation weight for explanations with/without quantifiers and the average learned explanation weight for explanations with/without conjunctions in Table 6

F Individual Explanation for Each Task
We show all the available natural language explanations for the six CLUES-Real dataset we use in this paper in Table 8 to 13.In Table 8 to 13, We also report the accuracy when using only one explanation at a time with ExEnt and the perplexity of each explanations.In Figure 10, we analyze the correlation between accuracy and perplexity of all explanations.There is a positive correlation between accuracy and perplexity of all the explanations.

G Results for Models Without Abstention
Here, we show ExEnt-FT and TALC experiment results without abstention on different adaptation size for each 6 tasks from CLUES-Real in Figure 11.

H Few Shot Learning
Here, we run a few-shot supervised version of ExEnt.We fine-tune the ExEnt model using k samples with gold labels from the evaluation tasks, where k = 4, 8, 16, 32.We report the results in Table 7.We observed that the test accuracy of this few-shot trained ExEnt is better than TALC.The performance with a few-shot model is better as the gold labels are quite different from noisy aggregated labels used by TALC for adaptation.We observed a huge label imbalance in the intermediate ExEnt model that results in lower accuracy for both TALC and ExEnt-FT, both of which leverage ExEnt 's predictions as noisy pseudo labels.Table 7: Few-shot fine-tuning with ExEnt.We report the mean and standard deviation for the accuracy across three runs using different seeds.TALC: Adding the explanation with highest perplexity w/o highest-perplexity explanation w/ highest-perplexity explanation

Figure 2 :
Figure 2: Accuracy (averaged over 6 tasks) of ExEnt-FT and TALC when training label aggregator with different adaption set sizes.Overall, increasing the adaptation ratio does not impact performance of ExEnt-FT, but improves performance of TALC.

Figure 3 :
Figure 3: Results for ExEnt-FT and TALC when varying the number of explanations used for training the label aggregator.Results are averaged over the six evaluation tasks.With increase in number of explanations, the accuracies using TALC improve while performance of ExEnt-FT is not affected.

Figure 4 :Figure 5 :
Figure4: TALC's performance only using the top X% explanations, where X = 20, 40, 60, 80, 100.On average, TALC has the best performance when only using the explanations with the highest quality.The performance of TALC decreases and then increases as we add explanations with lower quality.We see this trend because only the explanations with high quality are used at first and adding explanations with lower quality distract the label aggregator at first, but the label aggregator is able to distinguish high-quality explanations when the number of explanations keeps increasing.

Figure 7 :
Figure7: Comparison of TALC's performance before and after replacing good-quality explanations with malicious explanations.On average, TALC's performance drops by 7.3% as good explanations are replaced by malicious explanations.

Figure 12 :
Figure12: When ranking the explanations by their individual perplexity, removing the best explanation leads to a 1.0% drop in performance on average.

Figure 13 :Figure 14 :
Figure13: Comparison of TALC's performance before and after adding a low-quality explanation to a set of high-quality explanations.On average, the performance increases by 0.6% when ranking by perplexity.

Table 1 :
Comparison of zero-shot accuracies (higher is better) between non-adaptation-based ExEnt baselines, ExEnt-FT, and our proposed method, TALC, on the 6 different tasks from CLUES-Real.We report the mean and standard deviation for the accuracy across three runs for adaptation-based methods.The numbers in bold indicate the best accuracies across methods.
Table2: Comparison of zero-shot accuracies between TALC and the baselines when allowing the ExEnt model to abstain from making a prediction (the modified model is denoted as ExEnt-A).'A' stands for 'Abstention' for all the models in the table.For the adaptation methods (ExEnt-FT, TALC), we report mean and standard deviation across 9 adaptation ratios (0.2 to 1.0).Numbers in bold denote the best accuracies across methods.

Table 4 :
Prompt templates used for large language model experiments in 4.4.

Table 5 :
The best hyper-parameters for training label aggregator.

0
If the variance of the note is a negative number, it's more likely to be an original note.

Table 8 :
All explanations for banknote-authentication task used in this paper.

Table 9 :
All explanations for tic-tac-toe-endgame task used in this paper.

Table 10 :
All explanations for car-evaluation task used in this paper.If the wife's education is not high, then the contraceptive method used is no-use or short-term.

Table 11 :
All explanations for contraceptive-method-choice task used in this paper.

Table 12 :
All explanations for indian-liver-patient task used in this paper.0Frequentflyers with an annual income over 1 million usually take travel insurance.People with an annual income below 1,000,000 are less likely to have traveled abroad than those with annual incomes above 1,000,000.