Prompt Consistency for Zero-Shot Task Generalization

One of the most impressive results of recent NLP history is the ability of pre-trained language models to solve new tasks in a zero-shot setting. To achieve this, NLP tasks are framed as natural language prompts, generating a response indicating the predicted output. Nonetheless, the performance in such settings often lags far behind its supervised counterpart, suggesting a large space for potential improvement. In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance. Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency, encouraging consistent predictions over this diverse set of prompts. Our method makes it possible to fine-tune the model either with extra unlabeled training data, or directly on test input at inference time in an unsupervised manner. In experiments, our approach outperforms the state-of-the-art zero-shot learner, T0 (Sanh et al., 2022), on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy. The gains are often attained with a small number of unlabeled examples.


Introduction
While the past decade has demonstrated that pretrained language models (PLMs) are powerful tools for improving generalization from training datasets to test datasets (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020), more recent work has shown that they can even perform zero-shot generalization to new tasks without any annotated examples (Brown et al., 2020;Wei et al., 2022;Sanh et al., 2022).These systems leverage natural language prompts that specify the task for the model and represent different tasks in a unified format (Liu et al., 2021b).Zero-shot task generalization suggests a path towards generic systems that perform a wide variety of NLP tasks with no annotated examples.However, while enticing conceptually, zero-shot performance often remains relatively low compared to systems trained using even a small amount of task-specific labeled data.
In this paper, we examine methods to make PLMs better zero-shot learners using unlabeled text.Our work is motivated by consistency training methods that regularize model predictions to be invariant to perturbation (e.g.noise or paraphrasing) of the input examples.Consistency training is widely used in semi-supervised learning literature as an effective technique to utilize unannotated examples (Bachman et al., 2014;Sajjadi et al., 2016;Beyer et al., 2019;Xie et al., 2020a).It is often understood as a type of smoothness regularization or data augmentation (Xie et al., 2020a) and attains strong performance in semi-supervised learning.
Instead of example-level consistency, we propose to regularize prompt consistency, where a model is regularized to make the same prediction across a diverse set of synonymous task prompts.Prompt consistency regularization makes sense intuitively since PLMs should be robust across synonymous prompts, whereas it is known that model predictions are empirically very sensitive to the wording of the task prompts (Jiang et al., 2020).
Specifically, we design a pairwise distillation loss that encourages consistency between every pair of prompts (Figure 1).We refer to our method as swarm distillation, and it has the advantage of being fully unsupervised, only requiring unannotated inputs.Notably, unannotated examples are often relatively easy to collect.Drafting several prompts for a task is also far cheaper than annotating labels for each example -in fact, there are already well-designed prompts available for a wide range of NLP tasks (Bach et al., 2022).
Previous work on example-level consistency reg-

Swarm Distillation
< l a t e x i t s h a 1 _ b a s e 6 4 = " Z 1 S J 1 k A 7 m Z 1 v A e N c f q j q U 1 6 8 m T Q = " > A A A 1 u n i c l V v b c t t K d t V c k p k o t z O T t + Q F N b J r f F K y y v I 5 n q R S e R h d q I t F S Z R E X e x D H x c I b o K w c B O 6 C Z F i O D W P + Z K 8 J r + T v 8 n u b j T 2 b h D y q a j K F n q t 3 Y 2 + r O 6 9 C E L D P I 6 E f P P m f 3 / 2 8 1 / 8 8 i / + 8 l e / / q v 1 v / 6 b v / 2 7 v / / m N 7 + 9 E d m 0 C O A 6 y O K s u B v 6 A u I o h W s Z y R j u 8 g L 8 Z B j D 7 f B + T / G 3 J R Q i y t K + n O f w K f H D N B p H g S 8 R + v z N P x Y / L l 5 F 3 y 4 3 P X X x 5 d v l Q E S J l 7 8 q v l 3 / / M 3 G m 6 0 3 + s d b v d i u L j b W q p / e 5 9 9 8 / 5 + D U R Z M E 0 h l E P t C / L D 9 J p e f F n 4 h o y C G 5 f p g K i D 3 g 3 s / h B / w M v U T E J 8 W e h B L 7 y U i I 2 + c F f g v l Z 5 G e Y 2 F n w g x T 4 Y Y m f h y I p q c A t u 4 H 6 Z y / K + f F l G a T y W k g b n R e B p 7 M v P U j H i j q I B A x n O 8 8 I M i w r 5 6 w c Q v / E D i v K 2 v v 1 Q / 3 l n n 1 j v d 6 R 9 5 + 5 2 D 4 7 P j / v H 5 2 Z W n q f W 2 j m z i b z U M s T l M l t i G d + o X 9 5 7 A + + C s C y 8 b e 4 G f m 2 s 1 4 g L G U B R R G q p O j a I y E j Z s H I X T A n B A K T w G W Z L 4 6 W g x Q D C G s V w u F g N I v F d d v P 5 2 u V y J C X A d o L B R e 7 r U F l d E 4 a R u 7 F I V 2 q J k l t u Y f p a 3 R Q w z K b P E B u 3 q 0 k p c N W 7 f h v n P R Q x t x P C 5 i M B G B M 9 F j G z E S E X g M h z h 6 G I 1 Q s / 3 M F 4 t O o x x 6 4 w 8 n J v E b Q O v F b j 8 Y f s T t j I c e x v b q p H m s G f L x S D x i x A F 5 h e L g + O 7 Z l / w 2 g l B K T V D + u f 7 5 / o + A w k z q a W / K A B 7 r 4 h / M z d 2 2 + z o J u U k y x e D T p P t P C D b + b w Y F O W T 3 s 8 P e F 3 m k 2 j 5 S k H / g f / N V q a s k z 8 1 a u W q l p y A 9 L 9 e r 6 4 2 q 6 q N X r V G P j z h X L V 2 x Y 3 L V d w z N 2 9 E z p 5 W I m c q 8 m k 1 c j V w J W a k g 0 b t p L r T q 7 a m H 5 6 a o 2 p G q I l F u I H O N D p r o H O N z h t o o t G k q Q K N p s 3 Y q V T i m G K X Z p u e H f F K U O 4 E q b 4 3 Q k a m I t 7 G H 8 Y + z V 0 z T F V l Q W 0 t I b b S G G K 8 d 7 g 5 D / R Z Z w 5 D P K l h 0 4 u z R y h e B 5 j f t t Y H u F P 1 a Q X j j e 2 F O R f / N M D S Q m + P t u p 4 C k T S j 7 e 8 A z x j h c Q 8 p I 5 U o Q 5 C 5 E 2 L B 7 b F g 2 a L m p a P m b 3 n x t v q r s K z Q R 4 O r y q 8 t T U e p v 6 I q m x 8 t / H 9 S r X N u o 6 9 + o 4 3 9 b 0 e z p V J F l + d D k w o p v N V Z n H m o 6 U B O y G m 9 p W t f d V S + 9 L W 0 n n y M a u T 1 1 Y 9 M e b u Q s 9 M n d q e m Z p m g 5 M C o N k k a 2 / j u 9 U W a d Z Y 2 9 + t t u 2 n H u A i q M o t U w Y P Z s w 2 5 P l B O + 1 M 8 x w K T 7 V j m u l U z X T a m t n x C v + R 5 r 3 R 2 O v X r / 0 y i 0 b e V K i M H 4 2 9 P B M i Q s d m m s 5 j H z N S 1 f 7 z v V M m J c c E 1 T J G x Z j q V c z / e 5 B V Q 3 t 1 Q 3 s / 2 R C O O Q 1 B W x s T K 0 w b G q 5 7 h F K x t G 3 q 9 e t n Z Y K 9 8 + M w Q 1 M 2 S V r G i Z z p X R 3 0 1 Y G y p l Z G u m O b 2 m l p y g r e 3 g 8 H U b f 1 9 c O g 7 1 T a + c l K K 5 O K h k F W I 2 f q U 6 j p r r r 6 2 q K Y + k 3 1 9 u r 6 P b e + H W l 9 A + y 1 u n 6 2 w 5 X g I I q V W G N 1 g X Y F A 9 R V 1 d 4 4 z r J C 0 / r K 8 P q y C k B q m C x W T I 4 s c C N U P i f w 4 8 V + M 6 D 0 4 2 j E A z 6 b 6 y J Z G G q 5 0 i Q I 2 V 5 B M 8 t 6 R J A L Z R 1 z E c V Z q m 0 f T i 0 2 k S V e 6 R c R J j G w + B n B Z w w + J / i c w T 2 C e w y + I P i C w Z c E X z L 4 i u A r B v c J 7 j P 4 m u B r B t 8 Q f M P g W 4 J v G X x H 8 B 2 D P x D 8 g c E f C f 7 4 / P H q i g 6 M 6 p h G d 5 h + t f Q Y t 8 u 5 P Z f b 4 9 y + y + 1 z r u N y H c 4 d u N w B 5 w 5 d 7 p B z R y 5 3 x L l j l z v m 3 H u X e 8 + 5 E 5 c 7 4 V z X 5 b q c O 3 W 5 U 8 6 d u d w Z 5 8 5 d 7 p x z P Z f r c e 7 C 5 S 4 4 d + l y l 5 y 7 c r k r z v V d r s + 5 a 5 e 7 5 t y N y 9 1 w 7 t b l b j l 3 5 3 J 3 n P v g c h 8 4 9 9 H l r O x v u I U o n 0 B / j s D P r m / q u m W W w s J 8 8 X h g P R 4 0 P W p a P q b 2 n h t v q 7 s K z x p 5 O L 2 q 8 d b 2 e J j 5 Y + q y 8 d 3 G 9 y v d N u s + 9 u o 7 7 u p 7 P Z 0 r k y y + u h y Y U M z g q 8 z i r E e L A 7 s g p v e V 7 X 3 V 0 v v S 9 t J 5 8 j G t k 9 x n b z f y q t X t y 8 3 d r + / d b 2 x f c b f 9 y t / h f T L 9 f + a e 2 3 a 6 / W t t f + d e 2 P a 0 d r v b X r t d H a f 6 3 9 9 9 r / r P 3 v u 9 N 3 4 t 3 T u z 8 Z 0 5 / / r O r z m z X n 5 9 1 / / D 9 g n t u / < / l a t e x i t > ŷ ⇠ q(y|x, r (i) ) < l a t e x i t s h a 1 _ b a s e 6 4 = " i L T 6 m 0 2 w z T U e M 9 O u z h n P e 4 1 P W p a P q b 2 n m v v q r s K z x p 5 O L 2 q 8 c 7 2 e J j 6 I + q y 9 u 3 a d 0 v d 1 u s + 9 u p b 7 u o 7 P Z 1 L k y y + u h y Y U M z g q 8 z i r E e L A 7 s g p v e l 7 X 3 Z 0 O X e 6 Q c x 9 c 7 g P n j l z u i H N d l + t y 7 t j l j j l 3 4 n I n n D t 1 u V P O n b n c G e f O X e 6 c c x c u d 8 G 5 S 5 e 7 5 F z P 5 X q c u 3 K 5 K 8 5 d u 9 w 1 5 2 5 c 7 o Z z t y 5 3 y 7 x S / c a T e f S 2 7 C W L p m r 2 T + r 5 g M s e d R f 8 d a e j 7 t I G X v s 5 j U A 1 m v 4 l / v J z 3 P s 8 Z Z a X S x u w k x Z E q 4 Y S 6 E 0 a j X M / V o + p J o 9 p j m W r 8 O f C e 9 H 9 4 d 0 L 9 a q P / s O P a W J e W R U Z 7 r / Q r 5 q 9 6 E M U M R v 7 m P S l t 4 0 Z + e + V / 3 l / + / 4 P 7 / / z / X 8 Z 0 5 / + p O r z L y v O z / s / / R 8 H j e A f < / l a t e x i t > log p ✓ (r (j) y (ŷ)|r (j)  x (x)) 1. Sample pairs of prompts 2. Pseudo soft target from 2 8 2 t v 6 4 s X X x 3 d q f d 6 r / x f S r l d + t / P v K q 5 W t l e 9 X / r x y u H K + c r 0 y X L l f + e v K / 6 z 8 7 7 t v 3 5 2 8 6 7 2 7 M a E / / 1 l V 5 9 9 W n J 9 3 n / 8 G e i L M r Q = = < / l a t e x i t > r (i)

Distill to
< l a t e x i t s h a 1 _ b a s e 6 4 = " x n + b G b p s d 3 a S c p F n Z 7 z T Z z g O y n c 9 l P y + e + i K M v Q e 8 L r J J u H i l o P / C X 7 O l K e t k T 4 1 a m a o l J y D 9 r 9 e r q 8 2 q a q N X r Z 0 g i q y r w t 1 7 + s a e E z I d D j x h X p 5 3 Z / K V H + u g t z p Y e M l 8 s z E 1 H 2 s q i x 3 Y A R O n C m 2 x O V I 4 K F j 4 0 w R t T D U T 9 b c 4 M j P I n 8 I i / r 1 m 2 4 F L L y X X n X t T m / j p e M F 9 y 8 u 1 x X s D Z 9 u k 7 1 c 8 N z e O D X j j M 1 x g 4 z y h X 0 a 5 x I 5 r q 5 2 / W 7 H 9 Q 8 p 6 / u H m z s f X H j a 2 L 7 9 b + v F P 9 L 6 Z f r f x u 5 d 9 X X q 1 s r X y / 8 u e V w 5 X z l e u V 4 c r 9 y l 9 X / m f l f 9 9 9 + + 7 k X e / d j Q n 9 + c + q O v + 2 4 v y 8 + / w 3 r 1 / M r g = = < / l a t e x i t > r (j) Figure 1: An example of the proposed approach in an sentiment classification task.We apply multiple synonymous prompts to the unlabeled example, then we regularize the consistency of the predictions from different prompts, through our swarm distillation loss as detailed in Eq. 2.
ularization typically minimizes a consistency loss along with a supervised loss in a semi-supervised setting (Miyato et al., 2018;Xie et al., 2020a).Recently, Elazar et al. (2021) performed experiments optimizing a prompt consistency loss in the context of a relation prediction task, also incorporating a supervised version of the masked language model pretraining objective.In contrast, we (1) optimize a novel prompt consistency loss alone, making our approach completely unsupervised and agnostic to the model's pretraining objective, and (2) experiment on and demonstrate the practicality of such an approach for a broad variety of NLP tasks.Notably, this unsupervised setting poses additional learning challenges: without explicit supervision, the model may suffer from catastrophic forgetting and even exhibit a form of collapse where the model always makes the same predictions for any input.
To address this issue, we adopt two simple strategies: (1) we utilize parameter-efficient tuning techniques (Houlsby et al., 2019;He et al., 2022) to only update a small number of extra parameters, naturally mitigating catastrophic forgetting by fixing the original PLM parameters; (2) we propose an unsupervised criterion to select the model checkpoint before it falls into a collapsed local optimum.
In experiments, we build our method on top of a state-of-the-art zero-shot task learner, T0 (Sanh et al., 2022), and validate its performance on 11 datasets from 4 NLP tasks: natural language inference, coreference resolution, word sense disambiguation, and sentence completion.We perform experiments under two secenarios: (1) training the model with unlabeled training data; or (2) tuning the model with unlabeled test inputs directly.In both settings, we show that our swarm distillation method improves the accuracy of the 3B-parameter T0 model on 9 out of 11 datasets by up to 10.6 absolute points.We further scale model size up to 11B parameters, and demonstrate that our approach outperforms the 11B-parameter T0 model on 4 out of 4 datasets.Remarkably, analysis implies that these gains are often possible with only tens of examples, suggesting a small computation overhead.

Prompt-based Zero-Shot Task Generalization
Given a task where the input is denoted as x ∈ X and the goal is to predict y ∈ Y, we focus on the zero-shot task generalization setting: we aim to feed a PLM with x to predict y, where the PLM is never trained on the specific task to be performed.Zero-shot task generalization goes beyond traditional dataset generalization, as the model must generalize to new functions f : X → Y as opposed to new input examples, x.Recently, the development of prompting methods has advanced zero-shot task generalization by representing different tasks in a unified format (Liu et al., 2021b), and several prompt-based approaches have attained reasonable zero-shot performance (Brown et al., 2020;Sanh et al., 2022;Wei et al., 2022).A prompt r consists of an input template r x , an output template r y , and metadata to re-format the original x and y into new prompt-formatted input and target, r x (x) and r y (y).For example, as shown in Figure 1, in a sentiment classification task where we must predict positive or negative sentiment of the text, the input includes the field Sentence and the target consists of the field Label.An input template could be "Does the following sentence have a positive or negative sentiment?{Sentence}", and the target template is "Choices[{label}]".Here Choices is the metadata that is a list containing [Positive, Negative] to correspond to the numeric label ids.We note that such metadata is prompt-specific and can differ with different prompts for the same task -for instance, in Figure 1 the Choices list of the last prompt on the bottom is [Good, Bad].In prompt-based approaches the PLM models the conditional probability q(y|x, r) through p θ (r y (y)|r x (x)) where θ denotes the model parameters.In classification tasks where Y is a finite label set, q(y|x, r) is normalized over the possible labels at inference time to predict y: y ∈Y p θ (r y (y )|r x (x)) . (1) In generation tasks where Y is an infinite sequence space, the target template is typically instantiated as the target itself, i.e. p θ (r y (y)|r x (x)) = p θ (y|r x (x)), then the output can be directly decoded through sequence decoding approaches.Through designing such prompts for each task, all NLP tasks share the same data format, and models trained on one task may generalize to others.

Problem Definition
In this paper, we aim to explore unannotated examples to improve prompt-based zero-shot task generalization.Formally, we are given an unlabeled dataset in the task of interest {x 1 , x 2 , • • • , x N }, and we assume the dataset has K different prompts, {(r x , r (1) Our goal is to utilize these resources and adapt a PLM to predict r y (y) conditioned on r x (x).Unlabeled inputs are often available in practice, we consider two such scenarios in the paper.
First, we consider the case when unannotated examples from a non-test set are available.For many NLP tasks their inputs are plain text such as reviews, documents, or questions and can be easily collected (less so for other NLP tasks, like natural language inference the inputs are paired hypotheses and premises that can be non-trivial to obtain automatically).In this paper, we test this setting by utilizing the inputs of the training dataset.This is similar to Schick and Schütze (2021a,b) where they directly use the inputs of the training split as unlabeled resources to help few-shot learning.
Second is the case when unannotated test inputs are available.This is almost always true for any task.We use the test split to mimic the setting.While the limited number of unlabeled examples could potentially limit the effectiveness of some unsupervised learning methods, we show in §4.4 that our method is effective even with tens to hundreds of unlabeled examples.
On the other hand, a diverse set of prompts is not exceedingly difficult to collect practically -drafting prompts for each task is easier than annotating labels for many examples.In fact, the community efforts have pushed out a Public Pool of Prompts (P3) that contains thousands of prompts for hundreds of NLP datasets already (Bach et al., 2022).

The Prompt Consistency Loss
Consistency regularization is a method that creates different views (e.g.paraphrases of text) of the input and regularizes the outputs to be close to each other, and has achieved significant success in semi-supervised learning (Clark et al., 2018;Xie et al., 2020a,b).While previous methods use an additional module to perturb each example and then optimize example-level consistency, we propose to optimize prompt-level consistency which (1) is conceptually simple, and (2) can mitigate the fact that the predictions of PLMs are typically inconsistent with different prompts for the same task (Jiang et al., 2020;Elazar et al., 2021).Intuitively, we propose to regularize the predictions of different prompts for a given input to be close to each other, using a pairwise distillation loss to draw the predictions from one prompt closer to those from the other.Concretely, we randomly sample a few pairs of prompts and distill the pseudo target ŷ from one prompt r (i) to the other prompt r (j) , as illustrated in Figure 1.The loss function is defined as: where p d (x) is the empirical data distribution, p(r) is a uniform distribution over possible prompts, and q(y|x, r) is the conditional target distribution defined as in Eq. 1 but with a stopping gradient operator.We do not propagate gradients to q(y|x, r (i) ) following Miyato et al. (2018) and Xie et al. (2020a). 2 Stopping the gradient of one side in a pairwise consistency loss is also shown to help mitigate the collapse issue where all inputs lead to the same predictions (Chen and He, 2021).Different from traditional distillation that distills from a teacher model to a student model (Hinton et al., 2015), or previous consistency training that a single teacher distills to several students (Clark et al., 2018;Xie et al., 2020a), we perform distillation among a swarm of prompts where each prompt is a teacher and student at the same time, thus we term our method as swarm distillation.In our implementation, we approximate the expectation over the paired prompts (r (i) , r (j) ) with k randomly sampled pairs for training efficiency.Prompt consistency is related to examplelevel consistency when viewing different promptformatted inputs r (i) x (x) as separated views of the same example, thus our swarm distillation approach shares spirit with previous work on examplelevel consistency training and can be understood similarly from the perspective of unsupervised data augmentation, smoothness regularization, or label propagation (Xie et al., 2020a).In this paper, we focus on classification tasks where Y is a finite label set, while Eq. 2 can be directly applied to sequence generation tasks as well with sequence distillation (Kim and Rush, 2016).
Our approach differs from previous consistency training methods which often combine an unsupervised consistency loss with a supervised loss in a semi-supervised setting (Miyato et al., 2018;Clark et al., 2018;Xie et al., 2020a).Elazar et al. (2021) try to improve prompt consistency for a relation filling task with a pairwise two-sided KL divergence loss, while they also optimize a supervised version of the original PLM objective that turns out to be important.In contrast, our approach minimizes the swarm distillation loss in Eq. 2 alone, and therefore is completely unsupervised and agnostic to the pretraining objective.However, this setting also poses challenges in learning, which we discuss next.

Training
Being trained without explicit supervision, the PLM may forget what it learns during pretraining since the unsupervised consistency loss is different from the pretraining objective.Also, we note that prompt consistency may be achieved with a trivial solution -if the predictions from each example and each prompt collapse to the same label then maximal consistency among prompts can be reached.To mitigate such catastrophic forgetting and collapse issues, we propose two techniques: Parameter-efficient tuning: It has recently been observed that updating a small number of added parameters in a PLM is able to achieve comparable performance to tuning all the parameters (Houlsby et al., 2019;Li and Liang, 2021;Hu et al., 2022;He et al., 2022).Parameterefficient tuning methods naturally mitigate catastrophic forgetting and collapse through fixing the original PLM parameters.Specifically, we use LoRA (Hu et al., 2022), a low-rank adaptation method for PLMs.As shown in Figure 2, LoRA learns a low-rank approximation of the pretrained matrix updates: given a pretrained weight matrix W ∈ R d×m , LoRA learns to update it as W ← W + αBA, where B ∈ R d×b , A ∈ R b×m are low-rank matrices and α is a hyperparameter, and only B and A are updated during training.b d is referred to as the bottleneck dimension.Following He et al. (2022), we apply LoRA to the feed-forward weight matrices of every layer in the pretrained transformer (Vaswani et al., 2017) model.In our preliminary experiments, we found that LoRA is less likely to suffer from collapse, while on some datasets the model still collapses in the end even though it learns well in the middle.This motivates us to develop a criterion to select the model checkpoint before it falls into a collapsed local optimum, which we describe next.
Unsupervised model selection criterion: Our zero-shot setting does not have labeled validation data for model selection, and the swarm distillation objective is not an ideal selection criterion since it is minimized at collapse.Therefore, we would like to have an unsupervised criterion that encourages consistency but simultaneously penalizes collapse.With that in mind, we focus on Fleiss' kappa (Fleiss, 1971), a commonly used metric to assess the reliability of agreement.In our setting, Fleiss' kappa expresses the extent to which the amount of agreement among prompts exceeds what would be expected if all prompts made their predictions according to the marginalized distribution of labels.This design computes a notion of "relative consistency" and naturally penalizes collapse.Formally, let n ij be the number of prompts that predict the j-th label for the i-th example.There are a total of N K predictions where N is the number of examples and K is the number of prompts.Given an example x i , the agreement probability p i computes the normalized number of agreeing prompt pairs: then the "absolute consistency" P is: P is maximized in the case of collapse.However, Fleiss' kappa considers the marginalized distribution of labels: how likely are two prompts consistent if they make predictions randomly according to the marginalized label distribution?This chance probability Pe is: where q j represents the marginalized distribution of labels, i.e. p(y = j).Pe is large when collapse happens and one label dominates in the entire corpus.Finally, Fleiss' kappa is computed as: where 1 − Pe gives the degree of consistency that is attainable above chance, P − Pe gives the degree of consistency actually achieved above chance.κ ranges from -1 to 1. Eq. 6 naturally penalizes collapse, and in our experiments, we always observe a monotonic decrease of κ when collapse happens.Therefore, we select the model checkpoint after which κ monotonically decreases. 3We emphasize that we perform validation on the data that the model is trained on and do not require an additional development dataset.We include ablation analysis for both LoRA and the model selection components in Appendix C that shows that they are important for the success of our method.

Experiments
Our  et al., 2012); and (4) word sense disambiguation: WIC (Pilehvar and Camacho-Collados, 2019).We access them using Hugging Face Datasets (Lhoest et al., 2021) and most of them are from the Su-perGLUE benchmark (Wang et al., 2019a).All of these datasets are classification-based, predicting a discrete label from a finite set.Each of these datasets has a diverse set of prompts provided by the Public Pool of Prompts (Sanh et al., 2022) The number of prompts ranges from 4 to 15. Please refer to Appendix A for detailed statistics of these datasets.
Setup: We build our method on top of the PLM T0 (Sanh et al., 2022).T0 is an adapted version of the pretrained T5 model (Raffel et al., 2020) that is continually trained on multiple tasks with supervised, prompt-formatted examples.T0 outperforms GPT3 (Brown et al., 2020) and demonstrates state-of-the-art performance in zero-shot task generalization.All the tasks that we are studying are not included in T0's training data.We focus our major study on the T0 model version with 3 billion parameters (T0-3B), while we also include results using the largest T0 model with 11 billion parameters (T0-11B) on some datasets, due to the high computational cost of training T0-11B.The hyperparameters (e.g. the optimization hyperparameters) are tuned on the RTE dataset with its validation set and fixed for all other datasets.We use a bottleneck dimension of 1 for LoRA.Complete setup details can be found in Appendix B. training split and its self distillation results are from tuning on the validation split.We report the mean and std across 3 random runs, and also denote the absolute accuracy change compared to the T0-3B baseline.

Evaluation
Metrics: We use accuracy as the metric for all datasets.We report two different types of accuracy given that we have multiple prompts.The ensemble accuracy (Ens.)averages the output distributions of multiple prompts and makes predictions according to it.Ensembling multiple prompts has been explored before and found superior to using a single prompt (Jiang et al., 2020;Qin and Eisner, 2021).The median accuracy (Med.)within the set of prompts serves as a proxy for the expected performance when users specify a single prompt and input a prompt-formatted example.As our approach assumes availability of a set of prompts for the downstream task, and it is relatively cheap to craft several prompts for a task, ensemble prediction is the better option given input x, and it does empirically yield higher accuracy overall than the median for both the baseline and our method.Therefore, we will report both numbers but mainly discuss ensemble accuracy.We report these metrics on the validation split of each dataset.We run the experiments with 3 random seeds and report the mean and standard deviation.
Evaluation scenarios: We provide our methods with different unlabeled sources which lead to two practical scenarios during evaluation: (1) trainingtime tuning: we use the unlabeled training split from the corresponding dataset to train the model.This is similar to traditional settings where training and test data are different; and (2) test-time tuning (Sun et al., 2020;Wang et al., 2021): we directly adapt the PLM on the test data.This setting is reasonable, as we will always have access to the test inputs at test time.Intuitively, the unlabeled test sample x often provides hints about the distribution it was drawn, suggesting that we may update the model before making the prediction.This scenario is attractive since it alleviates the common distribution mismatch issue when there is a distribution shift between the training and test data.Compared to training-time tuning, test-time tuning typically uses less unlabeled data in our experiments since it uses the validation split itself.In the major experiments, we focus on the offline testtime tuning where we assume access to the entire test data4 and train our approach on all test examples, while in §4.4 we will discuss the potential for online adaptation where data arrives in a stream.
Baselines: As far as we know, there is no prior work studying unsupervised approaches for this prompt-based task generalization setting, thus T0 is the main baseline that we compare our approach against.However, we still implement an ablation baseline, self distillation, to separate the improvement from optimizing prompt consistency and pseudo-label distillation.Specifically, self distillation minimizes the same loss as in Eq. 2 but with r (i) = r (j) -instead of pairwise distillation, the prompt always distills its own prediction to itself.This baseline can be viewed as a prompt version of self-training, which has proven to effectively utilize unlabeled data (He et al., 2020;Xie et al., 2020b;Zhang et al., 2020).We report self distillation results in the training-time tuning setting only for simplicity.

Results
How well does swarm distillation work?We first compare swarm distillation against the T0-3B baseline.As shown in Table 1, the ensemble accuracy of swarm distillation exceeds the T0-3B baseline on 9 out of 11 datasets in both trainingand test-time tuning settings.Particularly, our approach improves the zero-shot performance on RTE by around 10 absolute points in all cases.Our approach slightly hurts ensemble accuracy of ANLI R3 and median accuracy of CB, but is overall comparable on these two datasets.Compared to self distillation, swarm distillation outperforms it on 9 out of 11 datasets in terms of ensemble accuracy, by up to 10.3 absolute points.These results further confirm the effectiveness of encouraging prompt consistency.We note that swarm distillation severely fails on WSC with a 10-point accuracy decrease compared to both T0 and self distillation, this is because Fleiss' kappa selects a bad model checkpoint, while our approach actually improves the performance on WSC in the middle of training as we will discuss more in §4.4.Notably, our approach is helpful on several datasets where T0-3B only shows nearly chance accuracy, such as   the gains of swarm distillation are attained together with more consistent predictions across different prompts.To this end, we report Fleiss' kappa, a commonly used metric for group agreement as detailed in §3.3.Results are shown in Table 4. Fleiss' kappa on 8 out of 11 datasets increases after swarm distillation, which boosts the averaged Fleiss' kappa of T0-3B by 14.6% relatively.This implies that swarm distillation facilitates prompt consistency, and potentially improves the robustness of PLMs to different wording of prompts.
Does the unsupervised criterion select the best model checkpoint?In §3.3, we discussed using Fleiss' kappa to select the best model checkpoint for evaluation, here we report the oracle accuracy numbers obtained by selecting the model checkpoint with the best validation accuracy, and compare it to the one selected by Fleiss' kappa.We compare the ensemble accuracy using T0-3B in the training-time tuning setting, with results in Figure 3. On most of the datasets, Fleiss' kappa is able to achieve numbers close to the best ones.On all 11 datasets, our oracle number outperforms the T0-3B baseline.In Table 1 we show that swarm distillation hurts the performance on WSC a lot, while in Figure 3 swarm distillation (oracle) in fact outperforms T0-3B, implying that the issue lies on model selection.Therefore, swarm distillation could potentially work better if an annotated dev set is available or when it is combined with other tech- niques in few-shot learning settings, where good checkpoints may be selected out more easily.
How many prompts do we need?Our approach requires a diverse set of prompts to regularize prompt consistency.Here we perform ablation experiments to understand the effect of the number of prompts on the performance.We take COPA and ANLI R2 as example datasets which have 8 and 15 prompts, respectively.We then vary the number of available prompts by randomly sampling a subset of prompts before training.We report the ensemble accuracy of swarm distillation (train) in Figure 4a.On both COPA and ANLI R2, we observe gains as we increase the number of prompts from 0 (the baseline), yet the performance saturates very quickly and relatively stabilizes when we provide 4 prompts.This implies that swarm distillation is not prompt-hungry and could work well with a small number of prompts.We note the with one prompt here Eq. 2 degenerates to a weaker version of self distillation compared to the one in Table 1 -self distillation in Table 1 utilizes all prompts during training while we assume access to only one prompt here.

How many unlabeled examples do we need?
We measure the effect of unlabeled data size.Specifically, we randomly sample a subset of examples from the train split for training and report results on the entire validation dataset.Results on WIC and ANLI R2 are shown in Figure 4b.Notably, swarm distillation is able to outperform the baselines (#examples=0) by a large margin on both datasets with only 10 unlabeled examples, and the performance starts to saturate quickly afterward.These results suggest that swarm distillation is not data-hungry and works reasonably well with few unlabeled examples, allowing swarm distillation to remain as a relatively light approach while typical unsupervised training (e.g.pretraining) often requires a large amount of data and computation.Also, we argue that the phenomenon demonstrated in the results implies that swarm distillation may be applied to the online setting of test-time tuning, where the batches of test data arrive in a stream.Online test-time tuning is a practical setting in real life, and we leave it as future work to study.

Discussion
In this paper, we explore prompt consistency regularization to make PLMs better zero-shot learners.Our approach, swarm distillation, utilizes unlabeled examples to attain zero-shot gains.While we use swarm distillation in a post-adaptation manner, it could be potentially combined with the pretraining objectives in the pretraining stage (e.g. the multi-prompt training loss (Sanh et al., 2022;Wei et al., 2022)), or even with annotated data in fewshot learning settings.Combining swarm distillation with these other losses may easily bypass the model collapse issue since the other loss typically discourages the collapsed local optimum.

Limitations
There are two limitations of our work: (1) Because our method is operated in a fully unsupervised manner, there is no supervised development data for us to either select the best model or tune hyperparameters.Thus, we propose to use Fleiss' Kappa as our unsupervised development metric for model selection, which attains decent performance in most cases.However, we also observe on very few datasets that the proposed metric fails to select the best checkpoints and hurt the model's performance.As discussed in §4.4,our method can be combined with few-shot learning where a few labeled data are provided, and we believe this can largely alleviate the issues of model selection in the unsupervised setting.(2) The other limitation and at the same time an advantage of our method is that the proposed method can work well even with 10 unlabeled data points.This certainly makes our method a good candidate for the online setting where batches of test data come in a stream.However, as we discussed in §4.4, the performance of our model saturates quickly as we increase the number of unlabeled data, which means the performance of our method cannot scale well with tons of unlabeled data like self-supervised pretraining.As discussed in §5, we expect combining our method with few-shot learning setting / pre-training can lead to further improvements as the supervised signals may guide the model to a better local optimum.

Ethics Statement
Similar to T0, this work aims to produce an openended system that could perform all text-based tasks through designing different prompts.While the performance of GPT3, T0, and this work is far from the practical level on unseen tasks, we expect a greatly improved prompt-based system in the future could be built to help perform many daily tasks in real life.However, the resulted model in this paper also admits the same ethics concerns that T0 has.For example, the unrestricted use of prompts may easily trigger offensive generations or private information leakage, and how to fix unwanted LM behaviors is still an active research problem (Liu et al., 2021a;Perez et al., 2022).

A Datasets
We present the statistics of the 11 datasets in Table 5.For the training-time tuning scenario, we use up to 10,000 data points from the training set for training if the train set contains more than 10,000 data points.

B Experimental Setup B.1 LoRA Setup
We use LoRA (Hu et al., 2022) as our parameterefficient tuning model and set the bottleneck dimension of LoRA weight matrices to be 1 for both 3B and 11B models.We emphasize that the linear mapping matrix B (or A) in LoRA needs to be initialized as a zero matrix to ensure the output distribution after adding LoRA layers is the same as the original PLM before training, otherwise, the zero-shot ability of PLMs would be broken upon initialization and there is no supervision to learn it back.For both models, we set the dropout probability for the the LoRA intermediate representations to be 0.3.Let α denote the scaling factor of LoRA that is used to scale the output of the LoRA layer before adding to the hidden states of the pre-trained model.We set α to be 4 and 2 respectively for the 3B and 11B model.The peak learning rates of the 3B and 11B models are set to be 3e-5 and 5e-5 respectively with a warm-up stage of 100 steps and polynomial learning rate scheduler.We train for a maximum of 1,500 steps.Note that the hyperparameters for the 3B model is tuned on the RTE dataset and used for other datasets.We did not tune the hyperparameters of the 11B model.

B.2 Implementation Details
The reported T0 baseline numbers are obtained from our own running using the released T0 weights.We are able to reproduce the numbers reported in Sanh et al. (2022), except for COPA where our T0 median number is higher than the originally reported one.
During training, at each update we first sample one input example x and apply all the prompts to reformat it as r 1 x (x), • • • , r K x (x), then we perform inference for them and randomly shuffle the predictions.Next we iterate over them with a batch size of 5/10 (3B/11B)5 and use the shuffled predictions to supervise them to compute the distillation loss, this implements the swarm distillation mechanism in Eq. 2 and amounts to approximating the expectation over paired prompts with K random pairs.We accumulate the gradients for 16 steps for one update so that each gradient descent is computed from 16 data examples.And we use 1 A40 GPU (45GB memory) to train the 3B model and 4 A40 GPUs with DeepSpeed Zero-2 (Ren et al., 2021) to train the 11B model.In general, training converges pretty fast and takes around 1 -3 GPU hours for the 3B model and 2 -6 hours for the 11B model depending on early stop points of different datasets.We use Adam (Kingma and Ba, 2015) as the optimizer with β 1 = 0.9, β 2 = 0.98 and = 1e − 6.For the Transformer (Vaswani et al., 2017)

C Ablation on LoRA and Model Selection
We report the ablation results on LoRA and unsupervised model selection in Table 6.Full finetuning hurts the T0-3B performance on all datasetsactually it collapses on almost all the datasets when we check its predictions, which could partially explain the low accuracies.Using LoRA alone is able to improve full fine-tuning generally and outperforms the T0-3B baseline sometimes.Moreoever, we find that unsupervised model selection is very effective to mitigate collapse and greatly improves full fine-tuning results.Finally, combining LoRA s b 8 t T D G L c 2 K B F t 9 M U D o x d J O Z 9 G g f W K G L j M k J n C Z g J i R y 4 y I A Z c B Y s Y u M y Y m d J m Q m I n L T I i J X C Y i 5 o v L f C H m 3 m X u i Y l d J l 5 q G R e J F w n c s f h Z d j R X h 5 1 Z w U 3 v y 1 R I b 5 S l v 5 e e + v y I c p y r k 8 d Z G C + p 2 k 7 d t l O 6 a + Y y G T G 5 y + T E P L j M A z G F y x T E C J c R x E i X k c R M X W Z K T O k y J T G P L v N I z M x l Z s T M X W Z O z J P L P C 2 N Q b M b A D N z V h / v Z b V J F m Y r D c d s 2 9 T 9 1 i 6 P R V S u r + Y Z x + E h w W x v l A H B b G O U I 4 L Z r i i B Y L Y l y j H B b D + U I c F s M 5 Q T g t l O K K c E s 2 1 Q f i G Y 7 Y H y n m C 2 A c q Y 4 J j B C c E J g 9 l E 8 x n O C G Z i L n O C m Z L L B 4 K Z j M u C Y K b h U h A s + K I S L N v n h E u 3 J J j p t n w k m I m 2 n B H M F F v O C W Z y L Z 8 I t l r t x K C e Q + m H K E W L b s G I r v V c B q O 8 1 p M Z j P x a z 2 Y w G m w 9 n c E I s f V 8 B q P G 1 h M a j C R b z 2 g w u m w 9 p Z F 7 9 p w G o 9 D W k x q M T F v P a j B a b Z 7 W l k t c L u H c s y c x G O m 2 n s V g 9 N t 6 G o M Rc e t 5 D E b J r S c y G D m 3 n s l g N N 1 6 K o M R d u u 5 D E b d r S c z G I m 3 n s 1 g d N 5 6 O o M R e + v 5 D E b x z 5 / Q u B e K K K g d S r J D + 2 O H t k 2 y S / A u g / c I 3 m P w P s H 7 D O 4 Q 3 G H w A c E H D D 4 k + J D B R w Q f M f i Y 4 G M G v y f 4 P Y N P C D 5 h c J f g L o N P C T 5 l 8 8 x L 5 6 w 4 m f + 0 O J U 7 S 6 + l L 9 e K e d W + 9 k u 3 f o 7 X X 2 j 0 6 P e k d n p 1 e e p l b b O r K O f 9 U w x P o g X m A b 3 o m f 3 3 s C 7 4 M T L L x 0 7 A 3 9 z F y r E e c w h j w P k 0 B 1 a h Q W o b B h 4 z C Y 5 o A D S u B x m M a x n 4 z K P o I R j O W i L P s Q e 6 + 6 e P 3 7 x 9 w O 5 3 Z d b p d z e y 6 3 x 7 m O y 3 U 4 t + 9 y + 5 w 7 c L k D z h 2 6 3 C H n j l z u i H P v X e 4 9 5 4 5 d 7 p h z X Z f r c u 7 E 5 U 4 4 d + p y p 5 w 7 c 7 k z z p 2 7 3 D b n 5 f z / 7 + S / + 7 p d / / w + / + v X q P / 7 T b / 7 5 X 7 7 5 7 b / e i H S a D + F 6 m E Z p f j f w B U R h A t c y l B H c Z T n 4 8 S C C 2 8 H 9 r u J v C 8 h F m C Y 9 O c / g U + w H S T g O h 7 5 E 6 D b / s X z 1 5 f e L z 9 + s b W 7 8 x + a b t 3 / 6 3 t v c 2 N z 8 / t 3 m p r p 4 9 w 4 x b w s R 9 b O 2 U v 2 c f / 7 t d 3 / p j 9 L h N I e Y y b V 4 I t h q t R O B e g 6 l H 6 L k L b o F I 7 r W c x m M 8 l p P Z j D y a z 2 b wW i w 9 X Q G I 8 T W 8 x m M G l t P a D C S b D 2 j w e i y 9 Z R G 7 t l z G o x C W 0 9 q M D J t P a v B a L V 5 W l s u d r m Y c 8 + e x G C k 2 3 o W g 9 F v 6 2 k M R s S t 5 z E Y J b e e y G D k 3 H o m g 9 F 0 6 6 k M R t i t 5 z I Y d b e e z G A k 3 n o 2 g 9 F 5 6 + k M R u y t 5 z M Y x T 9 / Q u N e y M N h 7 V D i b d o f 2 7 R t 4 h 2 C d x i 8 S / A u g / c I 3 m N w h + A O g / c J 3 m f w A c E H D D 4 k + J D B R w Q f M f g 9 w e 8 Z f E z w M Y O 7 B H c Z f E L w C Y N P C T 5 l 8 B n B Z w w + J / i c w R c E X z D 4 k u B L B l 8 R f M X g H s E 9 B l 8 T f M 3 g G 4 J v G H x L 8 C 2 D 7 w i + Y / A H g j 8 w + C P B H 5 8 / X l 3 R g V E d 0 + g 2 0 6 + W H u N 2 O L f r c r u c 2 3 O 5 P c 5 1 X K 7 D u X 2 X 2 + f c g c s d c O 7 Q 5 Q 4 5 d + R y R 5 x 7 7 3 L v O X fs c s e c 6 7 p c l 3 M n L n f C u V O X O + X c m c u d c e 7 c 5 c 4 5 d + F y F 5 y 7 d L l L z l 2 5 3 B X n e i 7 X 4 9 y 1 y 1 1 z 7 s b l b j h 3 6 3

Figure 2 :
Figure 2: A diagram of LoRA in the FFN sublayer.Only the LoRA parameters, A and B, are updated during training.

Figure 3 :
Figure 3: Analysis results to compare the model checkpoints selected by the unsupervised criterion Fleiss' kappa with the oracle model checkpoints selected by validation accuracy.

Figure 4 :
Figure 4: Ensemble accuracy of swarm distillation on three example datasets based on T0-3B, demonstrating the effect of prompt size and unlabeled data size.
models with model dimension d, the feed-forward intermediate dimension m and number of layers l, the additional parameters used in LoRA with bottleneck dimension b is calculated as b * (m+d) * 2 * l * 2. As we set b to be 1 for both the 3B and 11B models, the additional number of LoRA parameters is 1,671,168 for the T0-3B model (d = 1024, m = 16384, l = 24) and 6,389,760 for the T0-11B model (d = 1024, m = 65536, l = 24).

Table 1 :
Accuracy results on the validation set of 11 NLP datasets based on the T0-3B model.Swarm Distillation (train) and Swarm Distillation (test) use the unlabeled training split and validation split of datasets to train the model respectively, corresponding to training-time and test-time tuning.The Story Cloze dataset does not have a

Table 3 :
Ensemble accuracy on the distribution shift setting based on T0-3B."No Shift" represents the original setting where we train the swarm distillation loss on the training split from the same dataset as the test examples."SD on MNLI/QNLI" represents swarm distillation trained on the training split of MNLI/QNLI.

Table 4 :
Fleiss' kappa on 11 datasets based on T0-3B.Swarm distillation is trained on the training split of the respective dataset.

Table 5 :
Statistics of the datasets