MEAL: Stable and Active Learning for Few-Shot Prompting

Few-shot classification has made great strides due to foundation models that, through priming and prompting, are highly effective few-shot learners. However, this approach has high variance both across different sets of few shots ( data selection ) and across different finetuning runs ( run variability ). This is problematic not only because it impedes the fair comparison of different approaches, but especially because it makes few-shot learning too unreliable for many real-world applications. To alleviate these issues, we make two contributions for more stable and effective few-shot learning: First, we propose novel ensembling methods and show that they substantially reduce run variability . Second, we introduce a new active learning (AL) criterion for data selection and present the first AL-based approach specifically tailored towards prompt-based learning. In our experiments, we show that our combined method, MEAL ( M ultiprompt finetuning and prediction E nsembling with A ctive L earning), improves overall performance of prompt-based finetuning by 2.3 points on five diverse tasks. We publicly share our code and data splits in https://github.com/akoksal/MEAL .


Introduction
Pretrained language models (PLMs) are effective few-shot learners when conditioned with a few examples in the input (Brown et al., 2020;Min et al., 2022, i.a.) or finetuned with a masked language modeling objective on samples converted into cloze-style phrases (Schick and Schütze, 2021a; Gao et al., 2021).Prompt-based finetuning is especially promising as it enables researchers to train relatively small models as few-shot classifiers that can make accurate predictions with a minimal investment of time and effort.
However, prompt-based finetuning suffers from high variance.We observe two causes in our experiments: run variability (different seeds) and data selection (different training sets).Figure 1 illustrates this for five equal-size training sets and 20 runs for RTE (Dagan et al., 2006) and MRPC (Dolan and Brockett, 2005).Both sources of variance are of particular concern in few-shot learning.We may get lucky and select a "good" training set.But because no dev set is available there is also a high risk of selecting a "bad" training set, resulting in much lower performance than possible for the available annotation budget.In addition, run variability is a great methodological problem because it means that the exact same experimental setup (except for different random seeds, causing variance in the order of training examples and dropout layers) will give different results.This makes fair comparison of different algorithms and architectures difficult.
We propose new approaches to few-shot learning that address both sources of variance.We first focus on run variability and show based on loss/accuracy surface visualizations (Li et al., 2018) that run variability in few-shot learning is different from fullysupervised settings: solutions proposed for finetuning PLMs (Mosbach et al., 2021) do not work for few-shot prompt-based finetuning.Thus, we propose ensemble techniques to stabilize finetuning for different runs.After mitigating the effects of run variability via a more stable finetuning mechanism, we are able to address training data selection.We modify existing active learning (AL) algorithms and propose a novel approach for selecting training examples that outperforms prior algorithmsnot just in terms of final accuracy, but also regarding the diversity and representativeness of selected examples.In general, we are, to the best of our knowledge, the first to develop AL algorithms tailored to prompt-based finetuning.
We combine our contributions -decrease run variance and better training sets for improved performance and stability of few-shot classification -in MEAL (Multiprompt finetuning and prediction Ensembling with Active Learning).MEAL improves performance of prompt-based finetuning by 2.3 points on five tasks.Contributions: 1. We propose a training procedure that produces a single few-shot classification model with multiple prompts on top of PET (Schick and Schütze, 2021a).This reduces model space complexity and improves overall performance.
2. We show that run variability is a big problem in few-shot classification and conduct an exhaustive analysis of why existing solutions do not apply to few-shot prompt-based finetuning.We propose ensemble techniques to improve run stability.
3. We propose a novel AL method for data selection that outperforms prior AL work and random selection.Our work is the first to demonstrate that AL is beneficial in promptbased learning.

Related Work
Few-shot classification with language model prompting.GPT-3 (Brown et al., 2020) prepends examples as conditioning to the input during inference, without parameter updates.PET (Schick and Schütze, 2021a,b) follows a similar approach with finetuning and achieves comparable results, with fewer parameters.LM-BFF (Gao et al., 2021) and ADAPET (Tam et al., 2021)

Multiprompt Finetuning
Let M be a masked PLM, T its vocabulary, and MASK ∈ T the mask token.We use Pattern-Exploiting Training (PET) (Schick and Schütze, 2021a) for prompt-based finetuning experiments on few-shot classification without knowledge distillation and unlabeled data.Patterns (P) transform an input x into a cloze-style phrase x p with a single mask.Verbalizers (V ) convert each label l ∈ L into a single token s l ∈ T , representing the taskspecific meaning of the output label.
Our prediction for a label is its probability, according to the PLM, as a substitution for the mask:   where s m gives the raw score of V (y) from a PLM M for the MASK position in the cloze-style phrase of the input.Using the cross-entropy loss of P , PET trains a separate model for each prompt (i.e., single prompt finetuning).In inference, it ensembles model predictions by logit averaging.We propose multiprompt finetuning, a modified PET that trains a single model M on all prompts for a task simultaneously.During inference time, we also use ensembling with logit averaging across prompts.However, our approach generates a single finetuned model regardless of the number of prompts.Compared to PET, this reduces runtime, memory, and overall complexity.

Run Variability
In few-shot classification, finetuning PLMs such as ALBERT (Lan et al., 2020) with an MLM objective on samples converted into cloze-style phrases (Schick and Schütze, 2021b) performs comparably to much larger GPT-3 (Brown et al., 2020).Just as prompting methods are sensitive to data order (Lu et al., 2022) and label distributions (Zhao et al., 2021b), finetuning PLMs also exhibits sensitivity and instability as shown by Dodge et al. (2020b) for a fully supervised setting.
We show that the instability of finetuning PLMs also exists in few-shot prompt-based finetuning.Even though prompt-based finetuning does not introduce new parameters like classifier heads as in fully supervised classification, there is variance from dropout and training data order.We conduct experiments with multiprompt finetuning with default PET (Schick and Schütze, 2021a) settings without knowledge distillation.Figure 1 shows that runs with different random seeds for the same training set can vary by as much as 23.5 points.2021) suggest that longer training with a low learning rate and warm-up reduces run variability of PLMs.Their main motivation is to avoid models ending up in suboptimal training loss regions.However, this is not valid in few-shot prompt tuning as the number of training examples is low, and finetuning achieves almost zero training loss quickly.Our initial experiments show that longer training does reduce the standard deviation between different runs, but that it also causes lower mean accuracy for most tasks, of up to 7.3 points.

Mosbach et al. (
In Figure 2, we analyze run variability, by creating a training loss and validation accuracy surface visualization of two RTE runs with the same training set and multiprompt finetuning.The failed model θ f (red) achieves 58.5% validation accuracy while the successful model θ s (green) achieves 71.5%.The two models only differ in finetuning random seed.The figure illustrates the training loss and validation accuracy surfaces for combinations of the model weights of the pretrained model (θ p ), the failed model (θ f ), and the successful model (θ s ).We create a two-dimensional space based on and F is loss (left) or accuracy (right).We use 16 values for a and b to plot loss and accuracy surface forms.
Figure 2 shows that there is a large region with ≤1e-4 training loss (left graph, dark blue) that includes θ f and θ s .However, most of this region is suboptimal in terms of validation accuracy (right graph).This indicates that our instability problem differs from fully supervised finetuning where large learning rates often result in suboptimal training loss; in contrast, we observe ≈0 training loss for each run, including failed ones.Therefore, longer training with a low learning rate and warm-up only leads to finetuned models ending up in a similar re-gion with lower variance, but it causes suboptimal validation accuracy scores; see §6 for more details.
To overcome run variability, we propose two ensemble models: We ensemble the logits over runs in ENSEMBLE pred and take the average of parameters over runs in ENSEMBLE para .We will show that, for five tasks, these (i) reduce the effect of failed runs and run variability and (ii) achieve higher accuracy than accuracy averaged over runs.
The prediction of ENSEMBLE pred for x is: where s is softmax, R is the number of runs, P is the set of prompts, and F r gives, for the finetuned model in run r, the logit of each class for the input x with prompt p.Following work on averaging deep networks (Izmailov et al., 2018), we average each parameter of the finetuned PLMs across runs, resulting in a single model.The prediction of ENSEMBLE para for x is the prediction of this single model.

Data Selection
Another important source of variance for few-shot classification is training data selection.Figure 1 shows this effect: validation accuracy greatly varies, with a difference of up to 13.7.
Figure 3 shows how we modify AL algorithms for data selection in few-shot prompt-based finetuning.First, we use a PLM to get contextual embeddings, logits, and probabilities for each unlabeled example in a zero-shot setting.We exploit here that, due to the cloze-style format, PLMs can make predictions before any finetuning.Second, we apply modified AL algorithms for prompts.We select all examples at once to simplify the selection process.For each task, we select 16 * L training examples, where L is the number of labels.

Prior-Work Active Learning
We use a range of prior-work AL algorithms, including random, uncertainty-only (e.g., entropy) and combined approaches (e.g., BADGE).Although these are prior-work, adapting them to a prompt-based setup is non-trivial; e.g., for BADGE it requires concatenating gradient vectors across prompts.Therefore, this adaptation is one of the contributions of our paper.Importantly, none of the prior-work leverages the prediction variety across different prompts.where L is the number of labels, P is the set of prompts, and x i,p is input x i with pattern p. Breaking Ties (BT) (Luo et al., 2004) selects examples with minimum difference between the highest two probability classes.

bt(x
where l 1 and l 2 are the labels with highest and second highest probability for x i,p .Lowest Confidence (LC) (Culotta and McCallum, 2005) calculates lc as the sum of probability scores for the predicted class across prompts.We select examples with lowest lc.lc and bt give the same order when there are two labels.Batch AL by Diverse Gradient Embeddings (BADGE) (Ash et al., 2020) uses as representation the gradient of the cross entropy loss, conditioned on the one-hot encoding of the predicted label, with respect to the parameters of the final (output) layer.For prompt-based finetuning, we represent x i as the concatenation of the gradient vectors across prompts by using the decoder of the masked PLM head as the final layer.We find 16L (i.e., the number of training examples) cluster centers using kmeans++.These 16L cluster centers are then selected as the training set.We average BADGE over five seeds as k-means++ depends on initialization.

Prompt-Specific Active Learning
To make AL prior-work usable in prompt-based learning, we sum over different prompts in §5.1.However, these algorithms do not consider the varied predictions made by the PLM across different prompts.Therefore, we propose a new uncertainty-only algorithm, called Prompt-Pair-KL (PP-KL) specifically designed for prompt-based learning.We calculate pp-kl(x i ) as the sum of KL divergence scores across prompt pairs, and then select examples with the highest pp-kl.This approach gives high scores to x i with high variability in the model's predictions indicating that such examples are "non-redundant" in that each prompt contributes different information.
Inter-prompt uncertainty sampling with diversity (IPUSD) is our novel AL algorithm that combines prompt-specific uncertainty (i.e., PP-KL) and diversity sampling.It first represents each example x as a vector of dimensionality |P|•|L|, the concatenation of the L logits for x for each of the patterns in P. We utilize logits here as they represent the model's probability distribution, certainty and divergence across different prompts.We cluster these representations with k-means, k=8.We sample a training set, uniformly distributed over the 8 clusters.Then the uncertainty score of the training set is calculated as the sum of its Prompt-Pair-KL scores.We repeat the iteration loop 1000 times.Finally, we select the training set with the highest uncertainty score.We select based on 1000 iterations to ensure a balance between randomization and uncertainty.Our initial experiments suggest that choosing the most uncertain examples selects outliers, resulting in poor performance.As k-means and sampling depend on random seed, we repeat IPUSD five times.See §A.4 for the pseudo-code of IPUSD.

Experiments and Results
Setup We use a diverse set of five classification tasks to compare single to multiprompt finetuning, analyze run variability, and evaluate AL algorithms: RTE (Dagan et al., 2006), SST-2, SST-5 (Socher et al., 2013), TREC (Li and Roth, 2002), and MRPC (Dolan and Brockett, 2005).We use four prompts for each, described in §A.3.We report results on the validation set as we conducted all experiments without hyperparameter tuning by assuming a realistic few-shot scenario in which no dev set is available for tuning. 1 2021)'s longer training reduces run standard deviation, but causes suboptimal accuracy results for SST-5, TREC, and MRPC in multiprompt finetuning (L6 vs L5), and for all datasets in single prompt finetuning (L2 vs L1).We conclude: a longer training approach is not advisable for practical scenarios.
ENSEMBLE pred consistently reduces the standard deviation for each dataset, both in single prompt (L4 vs L1, 43%) and multiprompt finetuning (L8 vs L5, 51%).This reduction in standard deviation is accompanied by an increase in accuracy of up to 1.0 absolute points, contrary to longer training.On the other hand, ENSEMBLE para consistently performs better than the default only in multiprompt (L7 vs L5), but speeds up the prediction process during inference time with a single model while also reducing the standard deviation.
On top of that, both ENSEMBLE techniques avoid failed runs.For example, the default approach with multiprompt gets 87.4% average accuracy (not shown in the table) with one of the five random training sets in SST-2 while its worst run with the same training set has 77.6% accuracy.ENSEMBLE pred and ENSEMBLE para ensure better average accuracy (88.5% and 88.8%) without any suboptimal models (accuracy of worst trials: 87.8% and 88.4%) for the same training set.Thus, the default approach can result in suboptimal performance and is therefore not reliable for realworld applications without validation data.Overall, ENSEMBLE pred achieves clearly better overall performance and a lower standard deviation, but with the additional cost of multiple models (i.e.five in our experiments) during inference time.ENSEMBLE para is an alternative approach to increase stability and performance while providing a single model with lower time complexity during inference time.

Data Selection Table 2 compares our AL algorithms with uncertainty and diversity-based
prior-work.To provide more stable results and fair comparison by reducing noise from different runs, we employ multiprompt finetuning with ENSEMBLE pred for each AL algorithm in this section.Our results show that all uncertainty-only algorithms -entropy, lowest confidence, breaking ties and Prompt-Pair-KL (L2-L5) -perform worse than random selection (L1) on the average over five datasets.Our interpretation is that, considering that we are finetuning a PLM with few examples, finetuning with the highest uncertainty examples does not generalize well.In contrast, Schröder et al. (2022) found that uncertainty-only AL consistently performs better than random selection for fully supervised settings in PLMs.
AL prior-work that combines uncertainty and diversity -CAL (L6) and BADGE (L7) -perform better than uncertainty-only algorithms (L2-L5).Furthermore, BADGE outperforms random on three out of five tasks.However, BADGE has higher standard deviation (3.3) than random (2.8).Finally, when averaged over the five tasks, our proposed algorithm IPUSD (L8) performs better than random (L1) and better than all AL prior-work (L2-L7) with higher accuracy and lower standard deviation.

Analysis
We now perform an in-depth analysis of AL algorithms to understand their relative performance better and to understand failure cases like SST-5.We believe that these insights will lead to improved AL strategies in future work.Table 3 shows that our proposed AL algorithm, IPUSD, outperforms all AL prior-work with higher average accuracy, better ranking, and lower standard deviation.It outperforms random, a strong baseline, by 1.3 points.Table 2 shows that IPUSD performs worse than random only on SST-5 even though there is no large difference between random and IPUSD for diversity, representativeness, and label entropy on SST-5 (not shown in the table).The problem is that the SST-5 classes negative/very negative and positive/very positive are not clearly differentiated.In a manual investigation, we found that IPUSD selects examples that are good candidates either for both negative/very negative or for both positive/very positive.Thus, IPUSD succeeds in identifying the most challenging examples.But training on these does not increase accuracy because this is an underlying uncertainty of the dataset.To test this, we finetuned a fully-supervised RoBERTA LARGE model on a fine-grained sentiment analysis task with YELP (Zhang et al., 2015).This model achieves 47.7% accuracy on randomly selected examples vs. 39.2% with IPUSD.This suggests that IPUSD selects challenging examples (i.e., not clear which class they belong to), which are non-helpful examples if there is underlying uncertainty in class distinctions.

Balancing desiderata in AL
In summary, IPUSD makes the assumption that discrimination between classes can be learned well.If that is not the case, then it can underperform.
Table 2 also illustrates a rather small improvement in MRPC with a higher standard deviation than random.MRPC unlabeled data have a nonuniform distribution 68:32 for equivalent class vs non-equivalent class.As indicated by its label entropy score (2.0), IPUSD usually selects training sets with a distribution similar to the originalbecause of its clustering mechanism.However, IPUSD selected a training set with a 53:47 distribution in one of the five selections -very different from 68:32 and resulting in low accuracy (64.2, not shown).IPUSD's four other selections are close to 68:32 and have higher accuracy (71.2±1.3).
In summary, IPUSD makes the assumption that selected training sets have a label distribution similar to the overall distribution.If this assumption is not true, it can underperform.
Table 4 shows an ablation study that looks at MEAL's three main components: active learning, ensemble, multiprompting.We see that, in addition to providing more stable results, MEAL increases overall performance by 2.3 and 2.0 points over default prompt-based finetuning for ALBERT (Lan et al., 2020) and RoBERTa LARGE (Liu et al., 2019).The AL module of MEAL, IPUSD, gives

Conclusion
We demonstrate two stability problems of few-shot classification with prompt-based finetuning: instability due to run variability and training data selection.We show that existing solutions for instability fail.We first propose finetuning a single model with multiple prompts.This results in better performance and less model space complexity than finetuning several models with single prompts.We then propose run ensemble techniques that improve stability and overall performance.
Our setup with less run variability allows us to explore training data selection for prompt-based finetuning in a sufficiently stable experimental setup.We compare a set of modified AL algorithms to reduce training data selection instability and improve overall performance.Our novel AL algorithm, inter-prompt uncertainty sampling with diversity (IPUSD), outperforms prior AL algorithms (and random selection) for both ALBERT and RoBERTa LARGE .
Apart from our algorithmic innovations for fewshot prompt-based learning, we hope that our study will support fairer comparison of algorithms and thereby help better track progress in NLP.We publicly share our code and data splits in https: //github.com/akoksal/MEAL.

Figure 1 :
Figure 1: Multiprompt results with 32 examples for ALBERT on RTE and MRPC.Prompt-based finetuning has large variance depending on training data selection and random initialization.The accuracy difference can be up to 23.5 with different random seeds (RTE #3) and 13.7 with different training sets (RTE #1 vs #4).

Figure 2 :
Figure 2: Loss and validation accuracy surface visualizations for two RTE runs with the same training set.Left (training loss): The two models θ s and θ f have similar loss -they are both located in the upper right blue zero-loss triangle.Right (validation accuracy): The successful model θ s performs much better than the failed model θ f .
Figure 3: Our modified active learning pipeline for data selection is illustrated with an example sentence and two prompts for sentiment analysis.The PLM outputs several features in a zero-shot manner.AL selects a few-shot training set based on these output features.
lc(x i ) = p∈P max{P (y = l j |x i,p ) : j = 1..L} Contrastive AL (CAL) (Margatina et al., 2021) selects examples with the highest KL divergence between the example and its M nearest neighbors in the PLM contextual embedding space.cal(x i ) = m=M m=1 p∈P KL(P (y|x m,p )||P (y|x i,p )) We investigate three desiderata in AL: diversity, representativeness and label entropy.Diversity (Zhdanov, 2019) measures the redundancy/similarity of training examples by calculating the reciprocal of the average distance between unlabeled examples and their nearest training example.Representativeness (Ein-Dor et al., 2020) captures the well-known issue of selecting outlier examples in AL; it is calculated as the reciprocal of the average distance between the selected training examples and their k (k=10) nearest neighbors from unlabeled examples.Label Entropy (Prabhu et al., 2019) is the KL divergence between the class distribution of the unlabeled data and that of the selected training examples.

Table 1 :
Comparing stability techniques for prompt-based finetuning with single and multiple prompts with ALBERT on randomly selected training sets.Multiprompt improves overall performance compared to single prompt.ENSEMBLE pred improves stability while achieving higher performance for single prompt and multiprompt.Standard deviation is calculated across runs (trials for ENSEMBLE) and averaged over five random training sets.

Table 2 :
Comparison of active learning methods.Random, BADGE, IPUSD (inter-prompt uncertainty sampling with diversity) are non-deterministic.We run these algorithms for five random seeds and then average accuracy and standard deviation (averaged across training sets with a single trial for non-deterministic algorithms).Best results are indicated in bold, results better than random are underlined.

Table 4 :
Ablation.Performance over the five tasks for ALBERT xxlarge-v2 and RoBERTa LARGE .
. Active learning by acquiring contrastive examples.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650-663, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer.2022.Rethinking the role of demonstrations: What makes in-context learning work?arXiv preprint arXiv:2202.12837.Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow.2021.On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines.In ICLR.Ameya Prabhu, Charles Dognin, and Maneesh Singh.2019.Sampling bias in deep active classification: An empirical study.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4058-4068, Hong Kong, China.Association for Computational Linguistics.Guy Rotman and Roi Reichart.2022.Multi-task active learning for pre-trained transformer-based models.Transactions of the Association for Computational Linguistics, 10:1209-1228.Nicholas Roy and Andrew McCallum.2001.Toward optimal active learning through sampling estimation of error reduction.In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, page 441-448, San Francisco, CA, USA.Morgan Kaufmann Publishers Inc. Timo Schick and Hinrich Schütze.2021a.Exploiting cloze-questions for few-shot text classification and natural language inference.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255-269, Online.Association for Computational Linguistics.Timo Schick and Hinrich Schütze.2021b.It's not just size that matters: Small language models are also fewshot learners.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339-2352, Online.Association for Computational Linguistics.