Reordering Examples Helps during Priming-based Few-Shot Learning

The ability to learn from limited data, or few-shot learning, is a desirable and often critical requirement for NLP systems. While many existing methods do poorly at learning from a handful of examples, large pretrained language models have recently been shown to be efficient few-shot learners. One approach to few-shot learning, which does not require finetuning of model parameters, is to augment the language model's input with priming text which is typically constructed using task specific descriptions and examples. In this work, we further explore priming-based few-shot learning, with focus on using examples as prompts. We show that presenting examples in the right order is key for generalization. We introduce PERO (Prompting with Examples in the Right Order), where we formulate few-shot learning as search over the set of permutations of the training examples. We show that PERO can learn to generalize efficiently using as few as 10 examples, in contrast to existing approaches. While the newline token is a natural choice for separating the examples in the prompt, we show that learning a new separator token can potentially provide further gains in performance. We demonstrate the effectiveness of the proposed method on the tasks of sentiment classification, natural language inference and fact retrieval. Finally, we analyze the learned prompts to reveal novel insights, including the idea that two training examples in the right order alone can provide competitive performance for sentiment classification and natural language inference.


Introduction
The ability to learn from a few examples, or fewshot learning, as generally understood to be possessed by humans, is a desirable property for Natural Language Processing (NLP) systems as well. It is critical in scenarios where collecting large amounts of data is expensive. It is also important to enable a personalized Artificial Intelligence (AI) experience, where a single user is expected to use an AI agent to perform a task demonstrated through a handful of examples. 1 Pretrained language models (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020) have recently been shown to be exceedingly good at several benchmark NLP tasks (Wang et al., 2018(Wang et al., , 2019. Traditionally the parameters of these language models have been finetuned on task specific datasets to achieve the aforementioned performance gains, often requiring large amounts of data. Brown et al. (2020) show that large pretrained language models (GPT3) are also efficient few-shot learners. Fewshot learning is achieved using task descriptions and labeled examples as prompts. Remarkably, with this priming-based approach and without needing any parameter updates, GPT3 often performs comparable to traditional finetuning-based supervised systems which use much larger datasets. One could argue that the task performance achieved in the priming-based approach measures what the pretrained language model has already learned. Shin et al. (2020), operating in the same setting, use automatically generated prompts to measure task specific knowledge in a pretrained language model.
In this work, we further explore priming-based few-shot learning, while focusing on using examples as prompts. The training objective for a language model is typically the prediction of a token given a context. There is no clear incentive to treat a sequence of sentences in the context as equal and conveying examples of a concept. As a result, one could expect certain order of examples when used as a prompt to be more favorable at providing task Briefly, starting from a set of randomly initialized permutations, the genetic algorithm step computes the fitness of each permutation for making predictions using a pretrained LM. These fitness scores are then used for selection and subsequent breeding of new permutations using biologically inspired operations of mutation and crossover. The separator token learning step uses the updated set of permutations and uses gradient updates to improve the separator token. The two steps are performed iteratively for a fixed number of epochs and the best permutation and separator token are selected using a validation set. Please see Section 4 for details.
specific cues. We propose PERO 2 (Prompting with Examples in the Right Order), where we formulate the problem of few-shot learning as search over permutations of training examples. We find that choosing the right permutation is key to getting good task performance. In PERO, we use a genetic algorithm (Mitchell, 1998) (Devlin et al., 2019;Liu et al., 2019). We find that with as few as 10 examples, PERO can learn to generalize efficiently, in contrast to existing approaches. When concatenating examples to use as a prompt, the newline token is a natural choice as a separator token. We show that using a learned separator token can potentially provide further gains in performance. We evaluate the performance of PERO on the tasks of sentiment analysis, Natural Language Inference (NLI) and fact retrieval.
Finally, our analysis of the learned prompts (Section 5.5) leads to novel insights about few-shot learning using textual prompts. For instance, using only two examples, repeated and ordered using a learned label pattern, can provide performance 2 PERO source code is available at https://github.com/SawanKumar28/pero comparable to and even exceeding the performance of existing few-shot baselines which use thousands of examples.
In summary, we make the following contributions: 1. We propose PERO, where we formulate the problem of few-shot learning as search over permutations of training examples, and optionally a separator token. As we don't update the parameters of the underlying language model, PERO serves as a probe for measuring task specific knowledge in pretrained language models.
2. We demonstrate the effectiveness of PERO over a recent baseline on the tasks of sentiment analysis, NLI and fact retrieval.
3. We analyze the learned prompts and provide novel insights about textual prompts that can lead to good task performance in the low-data regime. In particular, we provide an effective recipe for one-shot learning.
We have released the source code of PERO to aid reproducibility of the results. (1) "Men are sawing logs" implies "Men are cutting wood" Answer: true (2) "There is no girl in white dancing" implies "A girl in white is dancing" Answer: false [Subj] is a subclass of [Obj] (1) Directors Lounge is located in Berlin (2) gingerbread is a subclass of cookie Table 1: Formatting used to create textual inputs for the tasks considered in this work. For sentiment classification, positive and negative sentiments correspond to the label text of true and false respectively. For NLI, entailment and contradiction labels correspond to the label text of true and false respectively.

Related Work
Pretrained Language Models using a transformer architecture (Vaswani et al., 2017) on large unsupervised corpora have recently been found to be efficient at learning downstream tasks, providing significant gains over existing standalone supervised systems, on a variety of NLP tasks (Wang et al., 2018(Wang et al., , 2019. There have been two major approaches to learning language models: causal language models (CLM) and masked language models (MLM). CLMs (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020) are typically trained by requiring a language model to predict the next token given a textual context. Masked language models (Devlin et al., 2019;Liu et al., 2019) on the other hand are trained by masking out a certain number of tokens in a textual context and requiring the language model to predict the masked out tokens. Typically, the parameters of the language model are then finetuned using task-specific training examples. For our experiments, we leverage publicly available pretrained masked language models (Devlin et al., 2019;Liu et al., 2019). Few-shot learning using language models is a desirable and perhaps even an expected property of large pretrained language models, given the large amounts of data they are typically trained with. Brown et al. (2020) show that scaling up language models leads to improved few-shot learning, with their best model, GPT3, being able to achieve performance comparable to existing supervised systems, while using much fewer examples. Zero-shot and few-shot learning are achieved without needing parameter updates to the model but instead by prompting the language model with task specific description and task specific examples. In this work, we study the impact of the order in which examples are presented in a prompt and show that searching over them can lead to significant gains in few-shot performance, without needing updates to the model parameters. 3 Measuring task performance of language models without any parameter updates can be seen as a measure of the knowledge (either descriptive, or procedural) that is already contained in the pretrained language model.
Probing knowledge contained in language models has been of interest, given the success of these models. Probing methods rely on creating cloze-style manual prompts (Petroni et al., 2019), or mining efficient natural language prompts (Jiang et al., 2020). Shin et al. (2020) rely on training examples to learn trigger tokens which when used as a prompt demonstrate the ability of language models to do sentiment analysis and NLI along with knowledge based completion, without needing any parameter updates. The learned trigger tokens however aren't very meaningful leading to difficulty in interpreting these results. In this work, we instead focus on using natural language training examples as prompts. While being more interpretable, the prompts used in this work lead to significant gains in performance in the low-data regime.

Background: Genetic Algorithm
A genetic algorithm (Mitchell, 1998) is a search heuristic inspired by the biological process of natural selection. Briefly, it evolves a population of candidate solutions towards increasing fitness to an objective through biologically inspired operations such as selection, crossover and mutation. We now describe the key terminology: Individual A single candidate solution, c, usually represented through a binary code but extensible to other types of codes. Generally, we will let c be denoted by the sequence of k integers c = (s 1 s 2 ...s k ).
Fitness The measure of goodness for an individual for the task, F (c i ).
Selection An operator to select fit individuals in a population which will be used to generate new individuals, through crossover and mutation. Better fitness leads to higher likelihood of selection.
Crossover An operator which typically takes two individuals c 1 and c 2 as inputs to produce new individuals d 1 and d 2 , by combining subsequences from the two inputs. For example, consider two input sequences: A single point crossover after the second position would lead to the individuals: Mutation An operator which randomly flips some elements in an input sequence. For example, with input c = (c(1)c(2)c(3)c(4)), a typical mutation operation would lead to the output Usually, each position is randomly altered with a mutation probability p m .
We now present the sketch of a typical genetic algorithm: 1. Initialize a set of individuals to form a popula- Repeat the following steps for N epochs iterations.

Compute fitness of each individual in the pop
3. Using the computed fitness, select individuals which will be used to breed the next generation.
4. With pairs of selected individuals, generate new individuals using the crossover operation.
5. Mutate the generated individuals using the mutation operator, to create a new population P .
6. Set P = P and go to step 2.

PERO: Proposed Method
The overall architecture employed in PERO is shown in Figure 1. We introduce the notation in Section 4.1. We discuss how we employ a genetic algorithm to search over permutations of training examples in Section 4.2. We then discuss how we augment the search heuristic to learn a task specific separator token in Section 4.3.

Notation and Input Format
For both classification and knowledge base completion tasks, we denote a textual task input by x and the gold label as y. We denote the pretrained masked language model with the operator L, which takes a sequence of input tokens to output a sequence of the same length containing token probabilities over the token vocabulary. With tokens where p i denotes a vector of probabilities over all tokens in the vocabulary. For all our experiments, the input to the language model is formatted with exactly one mask token. 4 For brevity, we denote by L Mask the operator which outputs the token probability at the mask token position.
The training data is denoted by the set of ex- We denote a permutation, or an ordered subset of size k of the training data, by For all tasks, we create an input text sequence by concatenating k examples using a permutation c of training examples, along with a test example x test : "Format(x c(1) ,y c(1) ) <Separator> Format(x c(2) ,y c(2) ) ..

<Sep-arator>
Format(x c(k) ,y c(k) ) <Separator> Format(x test ,mask)", where Format(,) formats the example text and label for a task, and <Sep-arator> is either the new line character, or is learned as described in Section 4.3. The formatting details are provided in Table 1. We attempt to use task agnostic formats and textual labels for classification tasks to the extent possible.

Genetic Algorithm: Search over Permutations of Examples
We employ a genetic algorithm for searching over permutations of training examples (see Section 3 for a brief introduction to genetic algorithms). We present the overall architecture in Figure 1.
Here, we detail how the various components and operators of a genetic algorithm are defined for searching over permutations of examples: Population A set of individuals.
Fitness For a given permutation of training example indices, fitness is defined as the average cross entropy loss over training examples when evaluated as in Figure 1. The cross entropy loss is computed over the set of possible labels for classifications tasks, and over all tokens in the vocabulary for knowledge base completion tasks.
Note that during search, a training example may occur both in the prompt and as well as the test example. This is generally not a problem as we are not finetuning the model and do not run the risk of learning to copy. When also training the separator token (Section 4.3), we ensure that the test example doesn't occur in the prompt by dropping it from the prompt if required.
Selection For selection, we use elitism, i.e., at each generation of individuals, we retain a certain percentage (elite ratio) of top performing individuals without any modifications. The rest of the population is created through crossover and mutation over a percentage (selection size) of top performing individuals.
Crossover We perform a single point crossover, while ensuring that the resulting individuals contain unique indices. Given two parents c 1 and c 2 , first a random number j is sampled in the range [k], the length of the individuals, to use as the crossover point. We define an operator First s (v, v ) which selects the first s elements in vector v which do not occur in vector v . Similarly, Last s (v, v ) picks the last s elements in v which do not occur in vector v . Denoting the subvector c(i)c(i + 1)...c(j) by c i:j , four new individuals are then created: This modification over a straightforward crossover ensures that the resulting individuals contain unique indices.
Mutation We perform mutation on an input candidate by changing each position with a mutation probability p m . When changed, an index is replaced by a random choice from the other training examples. If the new index is already present in the input candidate, the value at that index is swapped with the selected index.
The Genetic algorithm is run for N epochs (see Section 3 for the training flow). A validation set of the same size as the train set was used to select from the best performing individuals in each epoch.

Separator Token Learning
In addition to the search over permutations of training examples as described in the previous section, we optionally learn a separator token to concatenate the examples (see Figure 1).
We initialize a token embedding parameter with the token embedding of the newline character. At the end of each epoch of the genetic algorithm, we use gradient updates to estimate the token embedding. The training set is created using the individuals (prompts) in the population in the current generation, and replacing the answer of the final example with the mask token. Gradient updates are then done by requiring the model to predict the correct answer.

Experiments
In this section, we aim to answer the following questions:  (Shin et al., 2020) 18.9 PERO 40.3 Q3 What aspects of PERO are important for getting good performance? (Section 5.5) The experimental setup is described in Section 5.2, and the datasets are described in Section 5.1.

Datasets
Sentiment Classification: We use SST2 (Socher et al., 2013), a binary sentiment classification task. The training data contains 67350 training, 873 validation and 1822 test examples.
Fact Retrieval: We use the train, validation, and test splits created by Shin et al. (2020) (referred to as 'original' in the paper) for 41 relations. For our experiments, we use the manual prompts created by Petroni et al. (2019). Please see Appendix A.2.2 for relation wise prompts and training statistics.

Experimental Setup
Number of training examples: For most of our experiments, we limit to a total of 10 training examples. We chose this number as prior work (Shin et al., 2020) faced difficulty in enabling predictions using only 10 training examples, usually performing close to random prediction. We create 5 sets of size 10, chosen successively from the first 50 training examples, and report on average task performance. Although our focus is few-shot learning in the low data regime, we also present results with more examples (the first 100 and the first 1000 examples) for reference. For model selection, we use a label-balanced validation set (chosen from the beginning of the corresponding validation set) of the same size as the training data. In all cases, and irrespective of the number of training examples, we keep the prompt size fixed to 10 examples.
Pretrained LM: We use RoBERTa-large (Liu et al., 2019) for all our experiments except for the fact retrieval task where we use the bert-large-cased model (Devlin et al., 2019) as this model has been shown to work better for the task (Shin et al., 2020). RoBERTa-large has 24 layers, with 16 attention heads and a hidden size of 1024 (355M parameters). Bert-large-cased uses the same architecture as RoBERTa-large. We use the implementation of transformer architectures provided by Wolf et al. (2020). We use the </s> token as the default separator token. When learning a new separator token, we initialize the token embedding by the token embedding of </s> token, and finetune the embedding as discussed in Section 4.3.
Genetic algorithm: We run the genetic algorithm for 100 epochs for classification tasks and 30 epochs for fact retrieval tasks. The population size was fixed to 100 and the mutation probability was set to 0.1. Elite ratio was set to 0.1, while the selection size was fixed to 25. When training a separator token embedding, the maximum number of training epochs for learning the embedding was set to 10 for classification tasks and 5 for fact retrieval tasks. Gradient updates were performed using the AdamW optimizer (Loshchilov and Hutter, 2018) with a learning rate of 1e − 4.
Baselines: We use Autoprompt (Shin et al., 2020) and the traditional finetuning approach as few-shot baselines. Please see Appendix A.1 for hyperparameter details.

Overall Results
In this section, we present the few-shot learning capability of PERO. For reference, we also report results when using more data.
We present fact retrieval results (Precision@1 scores) in Table 2   on all relations. Overall, we show through PERO that simple manual prompts can be combined in relatively straightforward ways to create stronger probes while still being interpretable. 5 We present the label accuracies of PERO for sentiment classification and NLI in Table 3. In each case, PERO is able to generalize well when using only 10 examples, while existing approaches perform close to random guess ( 50%). When using more data, PERO is competitive with Autoprompt for both tasks, while finetuning does better than PERO for NLI with larger training sizes. Overall, PERO provides an efficient approach to few-shot learning with pretrained language models.

Comparison when using larger training sizes:
The results in Table 3 also suggest the use of finetuning when more data is available and the use of PERO when there isn't enough data for finetuning to generalize well. The relatively low performance of PERO with more data, especially for the NLI task, could be due to the much larger search space when using more training data. Since we keep the prompt size fixed to 10 examples, the search space is 10! for 10 training examples and 1000!/990! when using 1000 examples. While a better search strategy could potentially improve PERO's performance when using more data, we leave that as an interesting future work. Note, however, that the search space complexity is determined by the number of training examples irrespective of their labels. For example, PERO improves over the baselines on the fact retrieval task (Figure 2), despite a much larger number of labels.
For reference, we provide the label accuracies when using all available training data when using PERO, Autoprompt and finetuning respectively: 95.0, 91.4 and 96.7 for sentiment classification, and 79.5, 87.3 and 99.1 for NLI. When compared to the traditional fully supervised finetuning approach, PERO performs within 94.3% while using only 0.015% of the training data for sentiment classification, and within 82.1% while using only 0.77% of the training data for NLI.

Ablation on PERO's components
In this section, we present an ablation study to understand the role that the two components of PERO, namely genetic algorithm and separator token learning steps play. We present the label accuracies for sentiment classification and NLI with and without the separator token learning step (indicated as PERO-sep learning) in Table 4. The results indicate that the permutation search using the genetic algorithm step provides large gains by itself, while the separator token learning potentially improves it.  With the same search strategy as discussed in Section 4, we search for potentially bad permutations, by inverting the definition of fitness. For this experiment, to focus on the role of permutations, we do not train the separator token for this experiment. We present the average test set accuracies across training splits for the best and the worst permutations in Table 5. Additionally, we also evaluate 100 random permutations for each training split. The mean (and standard deviation) test accuracy across training splits and random permutations was 85.6(9.08) for sentiment classification and 67.9(8.99) for NLI.
The results indicate that PERO's learned permutations provide significant gains over other permutations constructed using the same examples. Selecting the right permutation, therefore, is important for generalization.

How many examples does PERO need
for good performance?
One could see a permutation learned by PERO as a combination of a label pattern ( Table 7: One-shot learning: Best and worst test set label accuracies with one-shot learning using training example pairs obtained from the first 10 training examples are presented. The best possible accuracies with the proposed one-shot learning approach is competitive with PERO using 10 examples, while improving over finetuning and Autoprompt using 10 examples. Please Section 5.5.3 for details. when using only two training examples in Table 6. Remarkably, two examples alone, when selected well, can go a long way towards good performance. Additionally, using the learned label pattern provides at least a 10 point improvement in accuracy when compared with a sequence without repetitions (details omitted). This indicates a potential recipe for one-shot learning which we discuss next.

Can insights gained from this work lead
to one-shot learning recipes?
To answer this question, we provide an example one-shot learning (one training example per class) algorithm which greedily grows a prompt sequence.
In contrast to Section 4, we don't use an additional validation set to select a good prompt sequence. We update the definition of fitness to prevent it from being biased towards one class by defining it to be the minimum and not the average of the cross SST-2 Best (Acc: 90.6 ) -ve sentiment: on the worst revenge-of-the-nerds clichés the filmmakers could dredge up +ve sentiment: demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . Worst (Acc: 56.2) -ve sentiment: remains utterly satisfied to remain the same throughout +ve sentiment: of saucy Table 8: Example training pairs for one-shot learning corresponding to the best and worst test set accuracies for sentiment classification. Please see Section 5.5.3 for details. entropy loss over the training examples. This is equivalent to minimizing the negative probability of the least probable target label.
Following Section 5.5.2, we allow an example to be repeated in a prompt sequence. Setting the maximum possible length of the prompt sequence, i.e., number of (potentially repeated) examples in the prompt sequence to l max , the algorithm then is comprised of the following steps: 1. Initialize an empty prompt, c = () 2. Create all possible prompts, P , formed by inserting exactly one example to c. If we denote the length of c as l c and the number of labels as N labels , the size of the set is given by N P = (l c + 1) * N labels .
3. Compute the fitness of prompts in P .
4. Select prompt c ∈ P with the best fitness.
5. Set c = c and go to step 2 if l c < l max .
We now discuss the results of using this oneshot learning approach over the tasks of sentiment classification and NLI. In each case, we consider the first 10 examples in the training set and create all possible training example pairs for one-shot learning, selecting one example from each class. This leads to 24 training example pairs in each case. We set the max length l max to 10, and ensure that the prompt sequence is label-balanced at each step. We summarize the results in Table 7. The results indicate that the proposed algorithm is an effective approach to one-shot learning. In Table 8, we show the training examples corresponding to the best and worst cases for the task of sentiment classification. While there is indication that more representative examples (such as longer examples) are more informative and thus more useful for oneshot learning, we leave a more thorough analysis as interesting future work.

Conclusion
In this paper, we propose PERO, a promising approach for few-shot learning, where we formulate learning as search over permutations of training examples, and optionally a separator token. We show the effectiveness of PERO for few-shot learning on the tasks of sentiment classification, NLI and fact retrieval tasks. We demonstrate that PERO provides an interpretable and a more accurate way to probe the knowledge contained in pretrained language models. Our analysis of the learned prompts reveals novel insights and cues for further research on few-shot learning, including one-shot learning.

A.1.2 Finetuning Experiments
For the finetuning experiments, following the recommended settings for small datasets by Mosbach et al. (2020), we trained models for 20 epochs, using AdamW (Loshchilov and Hutter, 2018), with learning rate linearly increasing to 2e − 5 in the first 10% epochs and then linearly decreasing to 0. The experiments were conducted on the same splits as PERO.

A.1.3 Training Time
Training time for PERO was approximately 3 hours for each experiment in the case of classification tasks, and approximately 30 minutes for each experiment of fact retrieval tasks.

A.1.4 Computing Infrastructure
We used Nvidia's GeForce GTX 1080 Ti GPUs for all our models. Each experiment was run on a single GPU.

A.1.5 Data
The experiments were done in the evaluation framework of Shin et al. (2020) who provide instructions for downloading the corresponding data splits at https://github.com/ucinlp/autoprompt.
Here, we provide more details on the classification datasets used. Details on the fact retrieval data are presented in Section A.2.2.
Sentiment Classification: We used the SST-2 dataset, the binarized version of the sentiment classification dataset created by Socher et al. (2013).
The training examples are constructed using movie review excerpts collected from rottentomatoes. com website, and labels obtained using Amazon Mechanical Turk's crowdsourcing platform. The percentage of examples labeled with positive sentiment in train, validation and test sets are 55.78%, 50.92% and 49.64% respectively. The number of examples labeled with positive sentiment in the training sets of size 10 used in the work are 4, 3, 7, 5 and 4. See Section 5.2 for selection and other details.

NLI:
We use the label-balanced 2-class NLI dataset created by Shin et al. (2020) using the SICK-E dataset (Marelli et al., 2014). The dataset was created using sentences from the 8K Im-ageFlickr data set 7 and the SemEval 2012 STS

A.2.1 Sentiment Classification
With the experimental setup described in Section 5, we performed additional comparison between Autoprompt and PERO by creating 100 training splits of size 10, chosen successively from the first 1000 training examples in each dataset. We report on the average (and standard deviation) test accuracy with Autoprompt and PERO in Table 9.

A.2.2 Fact Retrieval
We present relation wise training details and LAMA (Petroni et al., 2019) prompts which we used for our experiments along with the detailed relation wise test results in Table 10.

A.3 Validation Set Results
In this section, we provide the validation set results omitted from the main text.