Don’t Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Masked language models like BERT can perform text classification in a zero-shot fashion by reformulating downstream tasks as text infilling. However, this approach is highly sensitive to the template used to prompt the model, yet practitioners are blind when designing them in strict zero-shot settings. In this paper, we propose an alternative mining-based approach for zero-shot learning. Instead of prompting language models, we use regular expressions to mine labeled examples from unlabeled corpora, which can optionally be filtered through prompting, and used to finetune a pretrained model. Our method is more flexible and interpretable than prompting, and outperforms it on a wide range of tasks when using comparable templates. Our results suggest that the success of prompting can partly be explained by the model being exposed to similar examples during pretraining, which can be directly retrieved through regular expressions.


Introduction
Recent work has obtained strong zero-shot results by prompting language models (Brown et al., 2020;Chowdhery et al., 2022).As formalized by Schick and Schütze (2021a), the core idea is to reformulate text classification as language modeling using a pattern and a verbalizer.Given the input space X, the output space C and the space of possible strings V * , the pattern t : X → V * maps each input into a string with a masked span, whereas the verbalizer v : C → V * maps each label into a string.A language model can then be used for zero-shot classification by picking the most likely completion for the masked text arg max c∈C p(v(c) | t(x)). 2 In Figure 1: Proposed method.1) We mine labeled examples from a text corpus with regex-based patterns.
2) Optionally, we filter examples for which zero-shot prompting predicts a different label.3) We finetune a pretrained language model with a classification head.few-shot settings, better results can be obtained by prepending a few labeled examples (Brown et al., 2020), or using them in some form of fine-tuning (Schick and Schütze, 2021a;Gao et al., 2021).
However, prompting is known to be sensitive to the choice of the pattern and the verbalizer, yet practitioners are blind when designing them in true zeroshot settings (Jiang et al., 2020;Perez et al., 2021).Connected to that, subtle phenomena like the surface form competition (Holtzman et al., 2021) have a large impact on performance.Recent work has tried to mitigate these issues through calibration (Zhao et al., 2021), prompt combination (Schick and Schütze, 2021a;Lester et al., 2021;Zhou et al., 2022) or automatic prompt generation (Shin et al., 2020;Gao et al., 2021).At the same time, there is still not a principled understanding of how language models become few-shot learners, with recent work analyzing the role of the pretraining data (Chan et al., 2022) or the input-output mapping of  in-context demonstrations (Min et al., 2022).
In this paper, we propose an alternative approach to zero-shot learning that is more flexible and interpretable than prompting, while obtaining stronger results in our experiments.Similar to prompting, our method requires a pretrained language model, pattern, and verbalizer, in addition to an unlabeled corpus (e.g., the one used for pretraining).As illustrated in Figure 1, our approach works by using the pattern and verbalizer to mine labeled examples from the corpus through regular expressions, and leveraging them as supervision to finetune the pretrained language model.This allows to naturally combine multiple patterns and verbalizers for each task, while providing a signal to interactively design them by looking at the mined examples.In addition, we show that better results are obtained by filtering the mined examples through prompting.
Experiments in sentiment analysis, topic classification and natural language inference (NLI) confirm the effectiveness of our approach, which outperforms prompting by a large margin when using the exact same verbalizers and comparable patterns.Our results offer a new perspective on how language models can perform downstream tasks in a zero-shot fashion, showing that similar examples often exist in the pretraining corpus, which can be directly retrieved through simple extraction patterns.

Proposed Method
As shown in Figure 1, our method has three steps: Mine.We first use the pattern and a set of verbalizers to extract labeled examples from the corpus.To that end, we define patterns that are filled with verbalizers and expanded into regular expressions.For instance, the pattern and verbalizer in Figure 1 would extract every sentence following "is good." or "was good." as an example of the positive class, and every sentence following "is bad." or "was bad." as an example of the negative class.In practice, the patterns that we define are comparable to the ones used for prompting, and the verbalizers are exactly the same (see Tables 1 and 2).Appendix A gives more details on how we expand patterns into regular expressions.While prior work in prompting typically uses a single verbalizer per class, our approach allows to naturally combine examples mined through multiple verbalizers in a single dataset.So as to mitigate class imbalance and keep the mined dataset to a reasonable size, we mine a maximum of 40k examples per class after balancing across the different verbalizers.Filter.As an optional second step, we explore automatically removing noisy examples from the mined data.To that end, we classify the mined examples using zero-shot prompting, and remove examples for which the predicted and the mined label do not match.This filtering step is reliant on the performance of prompting, and we only remove 10% of the mismatching examples for which zeroshot prompting is the most confident.Finetune.Finally, we use the mined dataset to finetune a pretrained language model in the standard supervised fashion (Devlin et al., 2019) 1).We report average accuracy across 3 runs for all systems except prompting.w/ multi verb.: prompting with different sets of verbalizers (Table 9) and averaging the probabilities.
Approaches.We compare the following methods in our experiments, using RoBERTa-base (Liu et al., 2019) as the pretrained model in all cases: • Full-shot fine-tuning: We finetune RoBERTa on the original training set adding a new classification head.We train for 3 epochs with a batch size of 32.All the other hyperparameters follow Liu et al. (2019).Refer to Appendix B for more details.
• Zero-shot prompting: Standard prompting, described in §1.Multi-token verbalizer probabilities are calculated autoregressively, picking the most likely token at each step (Schick and Schütze, 2021c).We report results using both a single verbalizer per class, as it is common in prior work, as well as multiple verbalizers per class, which is more comparable to our approach.For the latter, we combine the probabil- ities of each verbalizer by averaging.4 • Zero-shot mining: Our proposed method, described in §2.For the mining step, we use the first 100 shards from the C4 corpus (Raffel et al., 2020), which cover 9.8% of the data.For the filtering step, we use single-verbalizer prompting to filter 10% of the mislabeled examples.
For the fine-tuning step, we use the same settings as in the full-shot setup, except that we train for 5,000 steps with a dropout probability of 0.4. 5To mitigate class imbalance, we form batches by first sampling the class for each instance from the uniform distribution, and then picking a random example from the mined data belonging to that class.
Patterns and verbalizers.We use comparable patterns for prompting and mining with the exact same verbalizers, which we report in Table 1 and 2 6: Filtering results (average accuracy).† : uses mined data for training and another supervised classifier as the filter.This is not a zero-shot setting and serves as an upper limit for the results using a perfect filter.More detailed results are provided in Table 11.
to capture sentences following a verbalizer, rather than sentences containing the verbalizer, as the resulting dataset would otherwise be trivial (solvable by detecting the presence of certain words).

Results and Analysis
We next discuss our main findings and report additional results in Appendix C.
Main results.We report our main results in Table 3.Our method outperforms prompting by 8.8 points on average, and the improvements are consistent across all tasks.
Effect of patterns and verbalizers.Table 4 reports sentiment results using different verbalizers.Consistent with prior work, we find that both prompting and mining are highly sensitive to the choice of the verbalizer, yet combining them all roughly matches the results of the best performing one.As shown in Table 5, using different patterns has an even larger impact.Interestingly, patterns and verbalizers that do well with one approach do not necessarily do well with the other.18.

Effect of filtering.
using any filtering at all.We find that promptingbased filtering brings modest but consistent improvements across all types of tasks.We compare this to filtering out all examples with mismatching labels with the full-shot model, which results in much larger gains and approaches the performance of the fully supervised system for sentiment and topic classification tasks.This can be seen as an upper-limit of what could be reached with perfect filtering, which leaves ample room to improve our approach focusing on the filtering step alone.
Qualitative analysis.We manually assessed 20 mined examples for sentiment analysis and report some representative instances in Table 7.We find that the mined data covers many domains like finance and technology.Most examples are correct (#1, #3), but there are also instances with wrong labels (#4).In addition, we find that 40% of analyzed examples show weak or neutral sentiment (#2).The impact of such irrelevant examples is unclear and worth of future study.

Related work
Recent work in zero-shot learning has explored a similar generate-filter-finetune approach, but using large language models instead of mining to generate training data (Schick and Schütze, 2021b;Liu et al., 2022;Meng et al., 2022;Ye et al., 2022).
Mining-based approaches have a long tradition in information extraction (Riloff, 1996;Riloff and Jones, 1999).However, to the best of our knowledge, we are the first to apply them for zero-shot learning as an alternative to prompting.2022) try locating a subset of the pretraining data that supports prompting in specific tasks.Finally, Razeghi et al. (2022) show a strong correlation between performance on specific instances and the frequency of terms from those instances in the pretraining data.

Conclusions
In this work, we have shown that mining-based zero-shot learning outperforms prompting.Moreover, our approach shows headroom for further improvement by exploring filtering techniques.The flexibility of our approach enables additional directions like domain filtering, bootstrapping, and interactive pattern/verbalizer design, where practitioners would inspect a few mined examples and refine their patterns until they are satisfied.In addition, our methods can serve as a partial explanation for why prompting works, showing that task-relevant examples are often present in the pretraining corpus in an explicit form, to the extent that they can be directly mined through simple regular expressions.Nevertheless, we believe that there can be other factors involved, as evidenced by the best patterns and verbalizers being different for mining and prompting, and we believe that delving deeper into the relation between pretraining data and prompting performance is an interesting future direction.

Limitations
Developing zero-shot methods in a rigorous manner is challenging: the strict zero-shot scenario does not allow using annotated data except for the final evaluation, yet it is difficult to make development decisions without any signal.We decided to use AG News and SST-2 during development without any exhaustive hyperparameter exploration, and evaluate blindly in the rest of the tasks.At the same time, we designed all patterns and verbalizers without any experiment, based solely on our own intuition.We believe that the comparison between prompting and mining is fair as we used comparable patterns with the exact same verbalizers and pretrained model.However, it is possible that our patterns, verbalizers and/or hyperparameters are suboptimal, and better results could be obtained with either prompting or mining using other configurations.
An important limitation of our approach is that it can be difficult to design extraction patterns for certain tasks like multiple choice questions.However, prompting is known to suffer from a similar limitation, with certain tasks like WiC being difficult to formulate as language modeling and obtaining random chance performance (Brown et al., 2020).Different from prompting, our approach requires an intermediate step after pretraining to mine data and finetune the model, which takes 2-7 hours using a single Nvidia Titan RTX GPU and 4 Intel Xeon CPUs.However, inference cost is similar or even faster than prompting, as our approach does not incur on any overhead for multi-token and multiverbalizer setups.

A Pattern expansion for mining
For each class, examples are mined by filling in the pattern with the verbalizer and extracting sentences that match the filled-in pattern.The process of expanding the patterns into regular expressions is as follows.First, we replace {VERBALIZER} with a capturing group containing all verbalizers separated by the alternation operator |.For example, the verbalizer good, great, awesome is expanded into (good|great|awesome).Finally, we replace the keywords described in Table 8 with the corresponding regular expressions.The result is a regular expression containing capturing groups for extracting sentences in a case-insensitive fashion.
Note that we use a simplistic sentence definition in order to keep the regex manageable.Since we assume that a period always ends a sentence, this mistakenly interprets abbreviations as multiple sentences (e.g., "U.S.A." contains 3 sentences).To address this, we filter out mined sentences shorter than 4 characters.

B Additional experimental details
Patterns and verbalizers.For each category of tasks we use the same mining pattern, as shown in Table 1.The complete list of verbalizers for each task is given in Table 9. Tasks with the same classes share the same verbalizers.This means that all sentiment and NLI tasks have the same verbalizers.Each topic classification task, however, has a unique set of verbalizers.Note that while SNLI and MNLI (3-way NLI) have the same verbalizers as RTE and QNLI (2-way NLI), the mined datasets do differ since 2-way NLI does not include a neutral class.
Hyperparameters.Table 10 shows the hyperparameters used for finetuning the RoBERTa-base model.All the other hyperparameters and classification head architecture follow Liu et al. (2019).We have two fine-tuning configurations, one for fine-tuning in the full-shot setting and one for zeroshot fine-tuning on the mined dataset.These configurations differ only in the maximum number of steps, dropout rate and batch sampler.
Datasets.We use Huggingface (Lhoest et al., 2021) for loading all evaluation datasets without any additional processing, except for MR which is detokenized using Moses scripts.We evaluate on the test set, falling back to the validation set for SST-2, MNLI, RTE and QNLI.

C Additional results
Complete results for full-shot, prompting and mining are combined in Table 11.Results showing the effect of pattern and verbalizer choice on binary sentiment classification are presented in Table 12 and Table 13, respectively.As explained in the main text, development experiments were only conducted on AGNews and SST-2.On these tasks, we found that high regularization partially mitigates overfitting caused by the misalignment between the mined dataset and real dataset.However, this high regularization shows mixed results for non-development tasks.For full transparency, we compare these performance differences in Table 14, but, in the main text, we stick to the original setup with high dropout to be faithful to the rigorous zero-shot scenario.
For multi-verbalizer prompting, we combine the probabilities of each verbalizer with an aggregation function.Results for using the average, the max and the sum are shown in Table 15.
In Table 16 we show the agreement between the mined labels and the labels according to the filtering method, which in our experiments is either a full-shot finetuned model or single-verbalizer prompting.
Table 17 and Table 18 show a random sample of examples from the mined training dataset for respectively binary sentiment analysis and NLI.In the main text, Table 7 shows a representative selection of examples for sentiment analysis.These examples were manually picked from the random sample in Table 17.Performance for three different templates on sentiment tasks comparing prompting and mining without filtering.Additionally, we show standard deviations over three seeds for the mining approach.The verbalizer column shows the verbalizer for the positive and the negative class, respectively.
Instead of mining examples for the target task, Bansal et al. (2020) define task-agnostic pretraining objectives on unlabeled corpora.Closer to our work, Meng et al. (2020) mask label-indicative words in an unlabeled corpus, and train a model to predict their corresponding label.Concurrent to our work, Han and Tsvetkov (

Table 1 :
Patterns.{VERBALIZER} is replaced with the verbalizers in Table2.For mining, *. captures everything up to a sentence boundary, and {INPUT}, {INPUT:HYP} and {INPUT:PREM} capture a single sentence.

Table 2 :
Verbalizers for sentiment classification and NLI.See Table9for verbalizers used in topic classification.When using a single verbalizer, we choose the one underlined.Multi-token verbalizers are in italic.

Table 3 :
, learning a new classification head.Main results (accuracy).All systems are based on RoBERTa-base, and all zero-shot systems use comparable patterns (see Table

Table 4 :
Average sentiment accuracy using different verbalizers.We report mining results without filtering.
More detailed results are provided in Table12.

Table 5 :
. These were designed without any experiment, simulating a zero-shot setting.We design our patterns Average sentiment accuracy using different patterns and verbalizers.We report mining results without filtering (more details are provided in Table13).

Table 7 :
Table 6 reports additional results using the full-shot systems for filtering, or not # Lbl Mined example 1 Pos.Do you have an idea of how broad your vocal range was? 2 Pos.Once home, we began priming. 3 Neg.People in Wall Street and other financial services firms should have paid more attention to the data.4 Neg.So I bought this unit, which said it had the same technical features as the other brand, such as number of channels etc, and this one performed amazing!! Mined examples for sentiment analysis.See more examples in Table 17 and mined NLI examples in Table

Table 8 :
Keywords that compile into regular expressions.These keywords are used in the mining patterns and verbalizers.

Table 9 :
Verbalizers.When using a single verbalizer we choose the one underlined.In the multi-verbalizer setting we use all listed verbalizers.Sentiment includes Amazon, IMDB, MR, SST-2 and Yelp; NLI includes MNLI, QNLI, RTE and SNLI.Multi-token verbalizers are italic.

Table 10 :
Hyperparameters for full-shot finetuning and zero-shot finetuning with the mined dataset.