Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning

Recent prompt-based approaches allow pretrained language models to achieve strong performances on few-shot finetuning by reformulating downstream tasks as a language modeling problem. In this work, we demonstrate that, despite its advantages on low data regimes, finetuned prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inference heuristics based on lexical overlap, e.g., models incorrectly assuming a sentence pair is of the same meaning because they consist of the same set of words. Interestingly, we find that this particular inference heuristic is significantly less present in the zero-shot evaluation of the prompt-based model, indicating how finetuning can be destructive to useful knowledge learned during the pretraining. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning. Our evaluation on three datasets demonstrates promising improvements on the three corresponding challenge datasets used to diagnose the inference heuristics.


Introduction
Prompt-based finetuning has emerged as a promising paradigm to adapt Pretrained Language Models (PLM) for downstream tasks with limited number of labeled examples (Schick and Schütze, 2021a;Radford et al., 2019). This approach reformulates downstream task instances as a language modeling input, 2 allowing PLMs to make non-trivial taskspecific predictions even in zero-shot settings. This in turn, provides a good initialization point for data efficient finetuning (Gao et al., 2021), resulting in a strong advantage on low data regimes where the standard finetuning paradigm struggles. However, the success of this prompting approach has only been shown using common held-out evaluations, which often conceal certain undesirable behaviors of models (Niven and Kao, 2019).
One such behavior commonly reported in downstream models is characterized by their preference to use surface features over general linguistic information (Warstadt et al., 2020). In the Natural Language Inference (NLI) task, McCoy et al. (2019) documented that models preferentially use the lexical overlap feature between sentence pairs to blindly predict that one sentence entails the other. Despite models' high in-distribution performance, they often fail on counterexamples of this inference heuristic, e.g., they predict that "the cat chased the mouse" entails "the mouse chased the cat".
At the same time, there is a mounting evidence that pre-training on large text corpora extracts rich linguistic information (Hewitt and Manning, 2019;Tenney et al., 2019). However, based on recent studies, standard finetuned models often overlook this information in the presence of lexical overlap (Nie et al., 2019;Dasgupta et al., 2018). We therefore question whether direct adaptation of PLMs using prompts can better transfer the use of this information during finetuning. We investigate this question by systematically studying the heuristics in a prompt-based model finetuned across three datasets with varying data regimes. Our intriguing results reveal that: (i) zero-shot prompt-based models are more robust to using the lexical overlap heuristic during inference, indicated by their high performance on the corresponding challenge datasets; (ii) however, prompt-based finetuned models quickly adopt this heuristic as they learn from more labeled data, which is indicated by gradual degradation of the performance in challenge datasets.
We then show that regularizing prompt-based finetuning, by penalizing the learning from up-dating the weights too far from their original pretrained values, is an effective approach to improve the in-distribution performance on target datasets, while mitigating the adoption of inference heuristics. Overall, our work suggests that while promptbased finetuning has gained impressive results on standard benchmarks, it can has a negative impact regarding inference heuristics, which in turn suggests the importance of a more thorough evaluation setup to ensure meaningful progress.

Inference Heuristics in Prompt-based Finetuning
Prompt-based PLM Finetuning In this work, we focus on sentence pairs classification tasks, where the goal is to predict semantic relation y of an input pair x = (s 1 , s 2 ). In a standard finetuning setting, s 1 and s 2 are concatenated along with a special token [CLS], whose embedding is used as an input to a newly initialized classifier head.
The prompt-based approach, on the other hand, reformulates pair x as a masked language model input using a pre-defined template and word-tolabel mapping. For instance, Schick and Schütze (2021a) formulate a natural language inference instance (s 1 , s 2 , y) as: [CLS]s 1 ?[MASK], s 2 [SEP] with the following mapping for the masked token: "yes"→ "entailment", "maybe"→"neutral", and "no" → "contradiction". The probabilities assigned by the PLM to the label words at the [MASK] token can then be directly used to make task-specific predictions, allowing PLM to perform in a zeroshot setting. Following Gao et al. (2021), we further finetune the prompt-based model on the available labeled examples for each task. Note that this procedure finetunes only the existing pre-trained weights, and does not introduce new parameters.

Task and Datasets
We evaluate on three English language datasets included in the GLUE benchmark (Wang et al., 2018) for which there are challenge datasets to evaluate the lexical overlap heuristic: MNLI (Williams et al., 2018), SNLI (Bowman et al., 2015), and Quora Question Pairs (QQP). In MNLI and SNLI, the task is to determine whether premise sentence s 1 entails, contradicts, or is neutral to the hypothesis sentence s 2 . In QQP, s 1 and s 2 are a pair of questions that are labeled as either duplicate or non-duplicate.

Original Input Premise
The actors that danced saw the author. Hypothesis The actors saw the author. Label entailment (support)

Premise
The managers near the scientist resigned. Hypothesis The scientist resigned. Label non-entailment (against)

Reformulated Input Premise
The actors that danced saw the author?
[MASK], the actors saw the author.

Premise
The managers near the scientist resigned?
[MASK], the scientist resigned. Label word No / Maybe Researchers constructed corresponding challenge sets for the above datasets, which are designed to contain examples that are against the heuristics, i.e., the examples exhibit word overlap between the two input sentences but are labeled as non-entailment for NLI or non-duplicate for QQP. We evaluate each few-shot model against its corresponding challenge dataset. Namely, we evaluate models trained on MNLI against entailment and non-entailment subsets of the HANS dataset (Mc-Coy et al., 2019), which are further categorized into lexical overlap (lex.), subsequence (subseq.), and constituent (const.) subsets; SNLI models against the long and short subsets of the Scramble Test challenge set (Dasgupta et al., 2018); and QQP models against the PAWS dataset (Zhang et al., 2019). 3 We illustrate challenge datasets examples and their reformulation as prompts in Table 1.
Model and Finetuning Our training and standard evaluation setup closely follow Gao et al. (2021), which measure finetuning performances across five different randomly sampled training data of size K to account for finetuning instability on small datasets (Dodge et al., 2020;Mosbach et al., 2021). We perform five data subsampling for each dataset and each data size K, where K ∈ {16, 32, 64, 128, 256, 512}. Note that K indicates the number of examples per label. We use the original development sets of each training dataset for testing the in-distribution performance. We per-  : In-distribution (bold) vs. challenge datasets (italic) evaluation results of prompt-based finetuning across different data size K (x axis), where K = 0 indicates zero-shot evaluation. In all challenge sets, the overall zero-shot performance (both blue and green plots) degrades as the model is finetuned using more data.
form all experiments using the RoBERTa-large model (Liu et al., 2019b).
Inference heuristics across data regimes We show the results of the prompt-based finetuning across different K in Figure 1. For the indistribution evaluations (leftmost of each plot), the prompt-based models finetuned on MNLI, SNLI, and QQP improve rapidly with more training data before saturating at K = 512. In contrast to the in-distribution results, we observe a different trajectory of performance on the three challenge datasets. On the Scramble and HANS sets, promptbased models show non-trivial zero-shot performance (K = 0) that is above its in-distribution counterpart. However, as more data is available, the models exhibit stronger indication of adopting heuristics. Namely, the performance on examples subset that support the heuristics increases, while the performance on cases that are against heuristics decreases. This pattern is most pronounced on the lexical overlap subset of HANS, where the median accuracy on non-entailment subset drops to below 10% while the entailment performance reaches 100%. The results suggest that few-shot finetuning can be destructive against the initial ability of prompt-based classifier to ignore surface features like lexical overlap. Finetuning appears to overadjust model parameters to the small target data, which contain very few to no counter-examples to the heuristics (Min et al., 2020;Lovering et al., 2021).

Avoiding Inference Heuristics
Here we look to mitigate the adverse impact of finetuning by viewing the issue as an instance of catastrophic forgetting (French, 1999), which is characterized by the loss of performance on the original dataset after subsequent finetuning on new data. We then propose a regularized prompt-based finetuning based on the Elastic Weight Consolidation (EWC) method (Kirkpatrick et al., 2017), which penalizes updates on weights crucial for the original zero-shot performance. EWC identifies these weights using empirical Fisher matrix (Martens, 2020), which requires samples of the original dataset. To omit the need of accessing the pretraining data, we follow Chen et al. (2020) that assume stronger independence between the Fisher information and the corresponding weights. The penalty term is now akin to the L2 loss between updated weights θ i and the original weights θ * i , resulting in the following overall loss: where L F T is a standard cross entropy, λ is a quadratic penalty coefficient, and α is a coefficient to linearly combine the two terms. We use the RecAdam implementation (Chen et al., 2020) for this loss, which also applies an annealing mechanism to gradually upweight the standard loss L F T toward the end of training. 4  Baselines We compare regularized finetuning with another method that also minimally update the pretraining weights. We consider simple weight fixing of the first n layers of the pretrained model, where the n layers are frozen (including the token embeddings) and only the weights of upper layers and LM head are updated throughout the finetuning. In the evaluation, we use n ∈ {6, 12, 18}. We refer to these baselines as FT-fixn.

Results
We evaluate all the considered finetuning strategies by taking their median performance after finetuning on 512 examples (for each label) and compare them with the original zero-shot performance. We report the results on Table 2, which also include the results of standard classifier head finetuning (last row). We observe the following: (1) Freezing the layers has mixed challenge set results, e.g., FT-fix18 improves over vanilla prompt-based finetuning on HANS and PAWS, but degrades Scramble and all in-distribution performances; (2) The L2 regularization strategy, rFT, achieves consistent improvements on the challenge sets while only costs small drop on the corresponding in-distribution performance, e.g., +6pp, +8pp, and +5pp on HANS, PAWS, and Scramble, respectively; (3) Although vanilla prompt-based finetuning performs relatively poorly, it still has an advantage over standard classifier head finetuning by +2.5pp, +2.0pp, and +1.0pp on the average scores of each in-distribution and challenge dataset pair. Additionally, Figure 2 shows rFT's improvement over vanilla prompt-based finetuning across data regimes on MNLI and HANS. We observe that the advantage of rFT is the strongest on the lexical overlap subset, which initially shows the highest   zero-shot performance. The results also suggest that the benefit of rFT peaks at mid data regimes (e.g., K = 32), before saturating when training data size is increased further. We also note that our results are consistent when we evaluate alternative prompt templates, or finetune for varying number of epochs. 5 The latter indicates that the adoption of inference heuristics is more likely attributed to the amount of training examples rather than the number of learning steps.

Related Work
Inference Heuristics Our work relates to a large body of literature on the problem of "bias" in the training datasets and the ramifications to the resulting models across various language understanding tasks (Niven and Kao, 2019;Poliak et al., 2018;Tsuchiya, 2018;Gururangan et al., 2020). Previ-ous work shows that the artifacts of data annotations result in spurious surface cues, which gives away the labels, allowing models to perform well without properly learning the intended task. For instance, models are shown to adopt heuristics based on the presence of certain indicative words or phrases in tasks such as reading comprehension ( Although the problem has been extensively studied, most works focus on models that are trained in standard settings where larger training datasets are available. Our work provides new insights in inference heuristics in models that are trained in zero-and few-shot settings.   (2020) show that methods improving compute and memory efficiency using pruning and quantization may be at odds with robustness and fairness. They report that while performance on standard test sets is largely unchanged, the performance of efficient models on certain underrepresented subsets of the data is disproportionately reduced, suggesting the importance of a more comprehensive evaluation to estimate overall changes in performance.

Conclusion
Our experiments shed light on the negative impact of low resource finetuning to the models' overall performance that is previously obscured by standard evaluation setup. The results indicate that while finetuning helps prompt-based models to rapidly gain the in-distribution improvement as more labeled data are available, it also gradually increases models' reliance on surface heuristics, which we show to be less present in the zero-shot evaluation. We further demonstrate that applying regularization that preserves pretrained weights during finetuning mitigates the adoption of heuristics while also maintains high in-distribution performances.  The last 3 rows are automatically generated templates and label words that are shown by Gao et al. (2021) to improve the few-shot finetuning further. Note that we use the corresponding task's template when evaluating on the challenge datasets.

Challenge datasets
We provide examples from each challenge datasets considered in our evaluation to illustrate sentence pairs that support or are against the heuristics. Table 4 shows examples for HANS, PAWS, and Scramble Test. Following Mc-Coy et al. (2019), we obtain the probability for the non-entailment label by summing the probabilities assigned by models trained on MNLI to the neutral and contradiction labels. We use the same-type subset of Scramble Test (Dasgupta et al., 2018) which contain examples of both entailment (support) and contradiction (against) relations.
HANS details HANS dataset is designed based on the insight that the word overlapping between premise and hypothesis in NLI datasets is spuriously correlated with the entailment label. HANS consists of examples in which relying to this correlation leads to incorrect label, i.e., hypotheses are not entailed by their word-overlapping premises. HANS is split into three test cases: (a) Lexical overlap (e.g., "The doctor was paid by the actor" → "The doctor paid the actor"), (b) Subsequence (e.g., "The doctor near the actor danced" → "The actor danced"), and (c) Constituent (e.g., "If the artist slept, the actor ran" → "The artist HANS (McCoy et al., 2019) premise The artists avoided the senators that thanked the tourists. hypothesis The artists avoided the senators. label entailment (support) premise The woman is more cheerful than the man. hypothesis The woman is more cheerful than the man. label entailment (support) premise The woman is more cheerful than the man. hypothesis The man is more cheerful than the woman. label contradiction (against) slept"). Each subset contains both entailment and non-entailment examples that always exhibit word overlap.
Hyperparameters Following Schick and Schütze (2021b,a), we use a fixed set of hyperparameters for all finetuning: learning rate of 1e −5 , batch size of 8, and maximum length size of 256.
Regularization implementation We use the RecAdam implementation by Chen et al. (2020) with the following hyperparameters. We set the quadratic penalty λ to 5000, and the linear combination factor α is set dynamically throughout the training according to a sigmoid function schedule, where α at step t is defined as: where parameter k regulates the rate of the sigmoid, and t 0 sets the point where s(t) goes above 0.5. We set k to 0.01 and t 0 to 0.6 of the total training steps.

B Additional Results
Standard CLS finetuning Previously, Gao et al. (2021) reported that the performance of standard non-prompt finetuning with additional classifier head (CLS) can converge to that of prompt-based counterpart after certain amount of data, e.g., 512. It is then interesting to compare both finetuning paradigm in terms of their heuristics-related behavior. Figure 4 shows the results of standard finetuning using standard classifier head across varying data regimes on MNLI and the 3 subsets of HANS. We observe high instability of the results when only small amount of data is available (e.g., K = 64). The learning trajectories are consistent across the HANS subsets, i.e., they start making random predictions on lower data regime and im-  We observe that standard prompt-based finetuning still performs better than CLS finetuning, indicating that prompt-based approach provides good initialization to mitigate heuristics, and employing regularization during finetuning can improve the challenge datasets (out-of-distribution) performance further.
Impact of prompt templates A growing number of work propose varying prompt generation strategies to push be benefits of prompt-based predictions (Gao et al., 2021;Schick et al., 2020). We   therefore questions whether different choices of templates would affect the model's behavior related to lexical overlap. We evaluate the 3 topperforming templates for MNLI that are obtained automatically by Gao et al. (2021) and show the results in Table 5. We observe similar behavior from the resulting models over the manual prompt counterpart, achieving HANS average accuracy of around 62% and below 55% on zero-shot and finetuning with 512 examples.

Impact of learning steps
We investigate the degradation of the challenge datasets performance as the function of the number of training data available during finetuning. However, adding more training examples while fixing the number of epochs introduces a confound factor to our finding, which is the number of learning steps to the model's weights. To factor out the number of steps, we perform similar evaluation with a fixed amount of training data and varying number of training epochs.
On 32 examples per label, we finetune for 10, 20, 30, 40, and 50 epochs. Additionally, we finetune on 512 examples for 1 until 10 epochs to see if the difference in learning steps results in different behavior. We plot the results in Figure  3. We observe that both finetuning settings result in similar trajectories, i.e., models start to adopt heuristics immediately in early epochs and later stagnate even with increasing number of learning steps. For instance, finetuning on 32 examples for the same number of training steps as in 512 examples finetuning for 1 epoch still result in higher overall HANS performance. We conclude that the number of finetuning data plays a more significant role over the number of training steps. Intuitively, larger training data is more likely to contain more examples that disproportionately support the heuristics; e.g. NLI pairs with lexical overlap are rarely of non-entailment relation (McCoy et al., 2019).
Regularization across data regimes Figure 5 shows the results improvement of L2 weight regularization over vanilla prompt-based finetuning on QQP and SNLI. Similar to results in MNLI/HANS, the improvements are highest on mid data regimes, e.g., 32 examples per label. Impact of pretrained model In addition to evaluating RoBERTa-large, we also evaluate on other commonly used pretrained language models based on transformers such as RoBERTa-base, BERT-base-uncased, and BERT-large-uncased. The results are shown in Table 6. We observe similar pattern across PLMs, i.e., improved in-distribution scores come at the cost of the degradation in the corresponding challenge datasets.