Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. How- ever, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.


Introduction
Large pre-trained language models are known to be effective zero-shot and few-shot learners when scaled (Brown et al., 2020;Artetxe et al., 2021;Rae et al., 2021).Much smaller masked language models (MLMs), like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), can be fined-tuned with only a few examples by utilizing prompt-based fine-tuning, which updates the model to select the correct target word or option (Schick and Schütze, 2021a;Gao et al., 2021).
In this paper, we hypothesize that discriminative pre-trained models like ELECTRA (Clark et al., 2020) will make even stronger few-shot learners as alternatives to MLMs as they are pre-trained to distinguish between challenging alternatives.To test this hypothesis, we explore prompt-based learning with ELECTRA by aligning its pre-training 1 Code is available at https://github.com/facebookresearch/ELECTRA-Fewshot-Learning.

MLM head great terrible
It is pretty damned funny .
It is pretty damned funny .
terrible I ate an apple pie so very hungry now .

I am
I ate an apple pie so full now .

MLM head great terrible
It is pretty damned funny .
It is pretty damned funny .
terrible I ate an apple pie so very hungry now .

I am
I ate an apple pie so full now .et al., 2013) and COPA (Roemmele et al., 2011).The underlined text is the task-specific template.c(•): contextualized embedding; y and y ′ : a correct and an incorrect option, respectively.
objective-distinguishing if a single token is generated or original-with prompt-based predictions for downstream tasks.We reuse ELECTRA's discriminative head to classify the correct target word as original tokens.As an additional benefit, we can naturally adapt the approach to multi-token spans by aggregating either hidden representations or output probabilities.In contrast, MLMs require autoregressive decoding to adapt to multi-token options (Schick and Schütze, 2021b).
We propose an approach to prompting ELEC-TRA, as shown in Figure 1.Though trained with the same or even less computation than BERT and RoBERTa, ELECTRA turns out to be a more effective few-shot learner.It outperforms BERT and RoBERTa by 10.2 and 3.1 points on average across 9 tasks with single-token options for base-sized models in the few-shot setting, and the trend prevails for large-sized models.ELECTRA also outperforms RoBERTa on 4 tasks with multi-token options.Our analysis suggests that the failing predictions from ELECTRA's generator could actually feed negatives with opposite meanings from the correct tokens to the discriminator, which strengthens ELECTRA's ability to distinguish concepts with opposite meanings for zero-shot predictions.

Prompting Masked Language Models
MLMs such as BERT and RoBERTa are trained by masking words in inputs and maximizing the probability of original tokens that are replaced by [MASK] tokens.Given a sequence x 1 , x 2 , • • • , x n with the i-th token masked, the objective is: , where e v denotes the embedding of the word v ∈ V.
We use c(•) to denote the contextualized representation for simplicity.Prompt-based learning turns the objective into a softmax distribution over all the target words of a prompt template (Schick and Schütze, 2021a;Gao et al., 2021).For example, in binary sentiment analysis, given an input sentence x, its associated label y ∈ {positive, negative} and a template T , we formulate the prompt as: By defining a mapping M : Y → V from the task label space to words in the vocabulary, the task is transformed into predicting the target word M(y): .
This formulation can be used for prompt-based zero-shot evaluation and few-shot fine-tuning to perform gradient updates.For tasks involving multi-token options, such as multiple-choice tasks, prompt-based fine-tuning with MLMs is less intuitive.For example, Schick and Schütze (2021b) adopt a multi-class hinge loss for training and devise a heuristic decoding method to estimate probabilities for target options during inference.The disadvantages are (1) such usage of MLMs deviates from the pre-training objective; (2) the pseudoautoregressive decoding approach cannot forward in batches during inference, which is computationally inefficient.

Discriminative Pre-trained Models
Discriminative pre-trained models such as ELEC-TRA (Clark et al., 2020) cast the word prediction problem into a binary classification problem.In ELECTRA, a discriminator and a smaller generator are jointly trained with the goal to distinguish if the tokens are sampled from the generator or from the original data: where {x i } are tokens from the original sentence, {x ′ i } are tokens from the corrupted sentence, and H denotes the discriminator head.We refer readers to Clark et al. (2020) for more details.
3 Method: Prompting ELECTRA Discriminative models like ELECTRA are strong alternatives to MLMs, so they have the potential to be effective few-shot learners even though they do not fit the current paradigm.Furthermore, ELEC-TRA could be more amenable to solving tasks involving multi-token options by reusing the discriminative head.In this section, we propose adapting ELECTRA to accommodate a wide range of tasks involving single-token or multi-token options for prompt-based learning.2

Tasks with Single-token Target Words
The prompts for ELECTRA models are formulated with an input sentence x, a label y ∈ Y, and a template T with the mapping function M.An example of sentiment classification is as follows: For each input sentence, we create |Y| prompts and forward them for gradient updates such that the model predicts the correct target word as an original token and incorrect ones as generated tokens: During inference, the model predicts how likely it is for each target option to fit into the sentence and outputs the most likely one.This approach allows us to perform prompt-based zero-shot prediction and few-shot fine-tuning analogously to the MLM paradigm 3 .Note that this approach requires forwarding the input with different target words |Y| times, which is less efficient than MLMs.

Tasks with Multi-token Target Options
We handily adapt ELECTRA's discriminative objective to accommodate tasks with multi-token options for prompt-based fine-tuning.The mapping M : Y → V * is an identity function for tasks where the target spans are the options themselves.Consider the multiple-choice task COPA (Roemmele et al., 2011); given a premise x, a template T , and an option y ∈ Y, we formulate the prompt as: T (x, y) = x so/because M(y) .
3 We also experimented with a variation to adapt the discriminative objective for contrastive learning, but the results were not as competitive.Please see Appendix F for details.
As an option M(y) contains multiple tokens, we either average the hidden representations of all tokens in M(y) (equivalent to y): where y j denotes the j-th token of an option y, or use the average probability of all tokens in y as the final prediction: or simply take [CLS] token's probability: H(c([CLS])).These approaches fully reuse pretrained weights of ELECTRA, including the discriminator head, and refrain from autoregressivestyle decoding.Similar to PET, we only use them for few-shot fine-tuning due to the discrepancy from pre-training.
4 Experimental Results

Setup
We run experiments with released checkpoints of BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and ELECTRA (Clark et al., 2020)   For tasks with single-token target words, we conduct prompt-based zero-shot evaluations, as well as standard 5 and prompt-based few-shot training for each checkpoint.We evaluate on 9 tasks including SST-2, SST-5, MR, MNLI, RTE, QNLI, SNLI, AGNews, and BoolQ 6 .For tasks with multi-token options, we evaluate the few-shot setting.These tasks include COPA, StoryCloze, HellaSwag, and PIQA.Details of datasets (including references) and prompts are in Appendix B and Appendix H.For our default experiments, we use 16 examples per label for single-token tasks and 32 examples for multiple-choice tasks.We follow Gao et al. (2021) to create a development set with the same size as the training set for model selection and conduct three runs of experiments to mitigate instability issues (Dodge et al., 2020) for all experiments. 7

Tasks with Single-token Target Words
Table 1 reports zero-shot and few-shot fine-tuning results on base-sized models. 8ELECTRA shows a clear advantage compared to BERT and RoBERTa, with an average margin of 7.9 and 3.5 points on zero-shot prediction, respectively, and an average margin of 10.2 and 3.1 on prompt-based few-shot fine-tuning.The difference is much smaller on standard few-shot fine-tuning (3.1 and 1.1, respec-5 We use the [CLS] token for prediction in standard finetuning, known as head fine-tuning in Le Scao and Rush (2021).
6 BoolQ is licensed under CC-BY-SA 3.0. 7More training details are in Appendix C. 8 Results on large-sized models are in Appendix D.
tively),9 suggesting that ELECTRA is inherently better at prompt-based learning, in addition to being a better model in general.On that note, we find that prompt-based fine-tuning consistently outperforms standard fine-tuning in line with prior work (Gao et al., 2021;Schick and Schütze, 2021b), which reinforces the importance of using prompts in the few-shot learning setting.

Tasks with Multi-token Target Options
For tasks involving multi-token options, we focus on the few-shot fine-tuning setting and we use taskspecific templates to encode data in all experiments.
For both models, we experiment with the few-shot fine-tuning setting where we map the [CLS] representations to scalars.For RoBERTa, we train a head from scratch and for ELECTRA, we reuse the discriminator head.Additionally, we test the PET approach (Schick and Schütze, 2021b) on RoBERTa models as illustrated in Figure 1.
As shown in Table 2, ELECTRA generally presents better and stabler performance than RoBERTa.PET (Schick and Schütze, 2021b), which uses a heuristic autoregressive decoding approach, in most cases outperforms RoBERTa with [CLS] fine-tuning, but still falls behind ELEC-TRA models.For ELECTRA, using average token representations is comparable or outperforms [CLS] representations for prediction on the basesized model but [CLS] fine-tuning leads to the best performance on the large-sized model.
These results demonstrate the potential of discriminative models on a broader range of tasks under the few-shot setting.10 5 Analysis

Number of Examples
Figure 3 shows the standard and prompt-based fewshot fine-tuning performance as the number of instances (K) increases for RoBERTa and ELEC-TRA on four datasets.11ELECTRA outperforms RoBERTa with a small K, and the two converge when K ≥ 256.The performance gap increases as the number of examples decreases, demonstrating that ELECTRA's discriminative pre-training objective is well-suited for few-shot applications.

Prediction Analysis
Figure 2 presents the output distributions of zeroshot predictions of RoBERTa and ELECTRA on SST-2. 12We normalize the RoBERTa output across target words (great, terrible) and keep the ELEC-TRA output as it is.For negative examples, the predictions from RoBERTa are only slightly skewed towards terrible, indicating that RoBERTa likely assigns a similar probability to the antonym great when masking the word terrible.This finding sheds light on why ELECTRA outperforms RoBERTa, as it has likely seen the closely-related alternative words during training and learned to suppress the probability of these words being original.We analyze RoBERTa's output distribution on its pre-training corpus to verify that the analysis does not spuriously correlate with the task template.We randomly sample sentences that either contain the word great or terrible and forward the sentences through the model after masking these two words.We visualize the normalized output distribution over great and terrible in Figure 2 and observe a similar pattern as RoBERTa's zero-shot prediction distribution on SST-2.It corroborates our hypothesis that masked language models fail to predict the correct word but instead output the antonym in some cases, e.g., when the ground truth is terrible, which enables ELECTRA to distinguish semantically opposite words and further strengthens its prompt-based prediction ability.

Conclusion
We explore discriminative pre-trained models for prompt-based zero-shot and few-shot learning.We find that these models consistently outperform masked language models that are trained with equivalent or even less computation, suggesting that discriminative pre-trained models are more effective zero-shot and few-shot learners.Analysis shows that the ELECTRA's generator could very likely feeds negatives like antonyms to the discriminator, which serves as a direct contrast during pre-training.We also speculate that discriminative models are less vulnerable to the surface form competition (Holtzman et al., 2021), and we would like to dig deeper into this hypothesis in future work.

Limitations
One limitation of this work is that we limit our exploration within the scope of discriminative tasks.It is prohibitively expensive to apply the prompting approach of ELECTRA to tasks without a limited set of candidates.The prompting approach we propose for ELECTRA requires one forward pass for each option in one example.In contrast, masked language models only require one forward pass for each example.
Another limitation is that we only include a limited set of continuation-based multiple-choice tasks for evaluation due to space constraints.We leave evaluating on a more diverse set of multiple-option tasks as future work.

A Model Details
We list the details of the pre-trained models, including training corpora, vocabulary size, training steps, and GLUE development set results in Table 3. ELECTRA, which is trained on the same set of corpora as BERT, outperforms BERT on GLUE datasets by 3 to 5 points.It slightly underperforms RoBERTa on the base size but is comparable to RoBERTa on the large size.

C Training Details
Following Gao et al. (2021), we conduct a grid search for all few-shot experiments and take learning rates from {1e-5, 2e-5, 3e-5} and batch sizes from {2, 4, 8}.For each trial, we perform gradients updates for 1000 steps, evaluate the model every 100 steps and select the model with the best validation accuracy.For full-shot experiments, we conduct a grid search with learning rates from {1e-5, 2e-5, 3e-5} and use a batch size of 16.

D Results on Large-sized Models
We present prompt-based zero-shot and few-shot results on large-sized models in Table 4 to show that the trend prevails when the model scales up.Except for SNLI, the average gain from prompt-based finetuning for ELECTRA is significantly larger than BERT and RoBERTa.Notably, ELECTRA also significantly outperforms BERT and RoBERTa on zero-shot prediction.

E Number of Examples
We show the few-shot results as a function of K on BoolQ, RTE, AGNews and MR in Figure 4. ELECTRA significantly outperforms RoBERTa on BoolQ and RTE across all settings, suggesting that ELECTRA is an overall stronger model for these datasets.On MR, we observe a similar pattern where the gap between ELECTRA and RoBERTa gets smaller, showing that ELECTRA benefits from prompt training more than RoBERTa.On AGNews, ELECTRA underperforms RoBERTa on standard fine-tuning but closes the gap on prompt-based finetuning, backing up the argument that ELECTRA benefits more from the prompt.

F An Alternative Contrastive Objective
We also explored another contrastive objective with ELECTRA's logits for prompt-based few-shot finetuning.For all the prompts of an input x with the label set Y, we define the loss as where H(x) = 1 1+e −ϕ(x) and ϕ(x) denote the logits from the discriminator.We directly contrast the correct target option with the incorrect ones with this objective.We show results on SST-2 and AG-News in Table 5. Prompt-based fine-tuning with the original ELECTRA objective outperforms the contrastive objective.We hypothesize that the downside of the contrastive objective is that it forces one input with different target options to be packed into the same batch instead of shuffling the whole dataset randomly, and it affects the optimization.We also experiment on the original discriminative objective with the same batch restriction and observe a performance drop to verify the hypothesis.

G Few-shot Output Distribution
We show the few-shot output distribution of RoBERTa and ELECTRA on SST-2 in Figure 5.The output distributions are polarized after fewshot training.

H Prompts
We largely follow previous works to construct our prompts.For sentiment classification tasks and natural language inference tasks, we use prompts from Gao et al. (2021).For AGNews, we use the prompt from Holtzman et al. (2021) and for BoolQ, we use the prompt from Schick and Schütze (2021b).For tasks involving multi-token options, we simply concatenate the context and options, which largely follows Holtzman et al. (2021).The prompt details can be found in Table 6 and Table 7.
To verify that the prompts does not affect our major conclusion, we conduct prompt-based few-shot finetuning experiments with different prompts for four tasks.The prompts we use are in Table 8. Results in Table 9 show that ELECTRA outperforms RoBERTa with different prompts.
Prompting RoBERTa on SST-2 with single-token options (b) Prompting ELECTRA on SST-2 with single-token options Prompting RoBERTa on SST-2 with single-token options (b) Prompting ELECTRA on SST-2 with single-token options

Figure 2 :Figure 3 :
Figure2: Zero-shot prediction distributions on SST-2 with RoBERTa (left) and ELECTRA (middle).Zero-shot prediction distributions on pre-training data that contain target words (right).Each sub-graph shows the output distribution for inputs associated with a label y ∈ {negative, positive} when prompted with the target words {great, terrible}.The y-axis shows the percentage of values in each subgraph.For RoBERTa, the values are normalized across target words, while for ELECTRA, the scores are the raw outputs from its discriminator.

Figure 4 :
Figure 4: Few-shot performance of RoBERTa v.s.ELECTRA with standard and prompt-based fine-tuning as K (the number of instances per label) increases on more tasks.

Figure 5 :
Figure 5: Few-shot prediction distributions on SST-2 with RoBERTa base and ELECTRA base .Each sub-graph shows the output distribution for inputs with a label y ∈ {negative, positive} when prompted with the corresponding target option M(y).

Table 2 :
Multiple-choice task results for prompt-based fine-tuning on RoBERTa and ELECTRA with 32 examples across three runs.CLS, prob and rep denote that we take the [CLS] representation, the average probability or the average representations for prediction.

Table 4 :
Zero-shot and few-shot (16 examples per label) and full-shot results of large-sized BERT, RoBERTa and ELECTRA.✓: denotes whether prompts are used or not.

Table 5 :
Few-shot prompt-based fine-tuning results on different objectives with ELECTRA base .Original w/o shuffling denotes that we load the batches without data shuffling to mimic the data loading restriction when training with the contrastive objective).

Table 6 :
Task templates for tasks with single-token verbalizers.

Table 7 :
Task templates for tasks with multi-token verbalizers.

Table 8 :
Task templates for task sensitivity test.

Table 9 :
Few-shot results with different templates with base-sized models.ELECTRA still outperforms RoBERTa with different templates (provided in Table8).