Issues with Entailment-based Zero-shot Text Classification

The general format of natural language inference (NLI) makes it tempting to be used for zero-shot text classification by casting any target label into a sentence of hypothesis and verifying whether or not it could be entailed by the input, aiming at generic classification applicable on any specified label space. In this opinion piece, we point out a few overlooked issues that are yet to be discussed in this line of work. We observe huge variance across different classification datasets amongst standard BERT-based NLI models and surprisingly find that pre-trained BERT without any fine-tuning can yield competitive performance against BERT fine-tuned for NLI. With the concern that these models heavily rely on spurious lexical patterns for prediction, we also experiment with preliminary approaches for more robust NLI, but the results are in general negative. Our observations reveal implicit but challenging difficulties in entailment-based zero-shot text classification.


Introduction
Natural language inference (NLI, Bowman et al., 2015), also known as recognizing textual entailment (RTE, Condoravdi et al., 2003;Dagan et al., 2005), is normally formatted as the task of determining whether or not a premise sentence semantically entails a hypothesis sentence. The generality of the task format has aroused some recent studies to apply NLI models for various downstream applications (Poliak et al., 2018), and more recently text classification (Yin et al., 2019(Yin et al., , 2020, making them generally-applicable solutions along with all those similar attempts to build a universal framework for various NLP tasks (Kumar et al., 2016;Raffel et al., 2020, inter alia). Text classification is then reduced to textual entailment by setting * Work during internship at Microsoft Research Asia. the input sentence as the premise and simultaneously casting the candidate label into a hypothesis sentence using pre-defined templates or lexical definitions from WordNet. Once we have any pretrained NLI models at hand, zero-shot text classification under any specified label space is enabled for free without the need to collect annotated data. With contextualized representation based on pretrained language models such as BERT (Devlin et al., 2019), NLI performance has been drastically improved. Promising empirical results have been shown on various text classification benchmarks that vary across topic classification, emotion classification, and situation classification, outperforming earlier standard approaches (Chang et al., 2008) or simple scoring schemes derived from distributional similarity (Mikolov et al., 2013).
However, such generality is conceptually contradictory with the specificity of text classification in many practical scenarios. In this opinion piece, we conduct extended analysis on the recent attempts (Yin et al., 2019) and point out some implicit issues under entailment-based zero-shot text classification that are overlooked in this line of work. We experiment with additional classification datasets and observe huge variance across them amongst standard BERT-based NLI models. More surprisingly, we find that raw BERT models without fine-tuning can sometimes yield more competitive results. We also experiment with preliminary approaches for improving the robustness of NLI models, but only to find negative results in general. Our observations reveal implicit but massive difficulties in building a successful general-purpose zero-shot text classifier based on text entailment models.

Our Investigation and Implied Issues
We attempt at re-examining the earlier study (Yin et al., 2019) with extended analysis to help estab-lish a better understanding of zero-shot text classification based on textual entailment. Our focus is to check how well the models pre-trained for NLI could generalize to the prediction of unseen categories, which is the major target of zero-shot classification. We did not study the setting that test set also include labels seen in training, commonly phrased as generalized zero-shot learning (Xian et al., 2018) and referred to as the label-partiallyunseen setting by Yin et al. (2019). That setting strongly assumes that a bunch of in-domain data for a number of classes are available already. Additionally, we extend our experiments with the test sets from the following datasets: Snips : A popular dataset 2 for intent detection collected from the Snips personal voice assistant (Coucke et al., 2018), with seven intent labels.
AG's news : To further study the models on topic classification in a different genre, we additionally use the English news data from (Zhang et al., 2015) that consists of four types of articles: World, Sports, Business, Sci/Tech. SST-2 : The Stanford Sentiment Treebank dataset 3 processed by Socher et al. (2013) for sentiment polarity classification with binary labels (positive and negative).
1 Another reason for not studying on this setting is that the split of development set and test set in (Yin et al., 2019) contain the same label space, which is flawed to be used for any claim on the performance of "unseen labels". 2 https://github.com/snipsco/snips-nlu 3 For SST-2 we follow Zhang et al. (2021) and Gao et al. (2021) to use the development set from GLUE for testing.

Experimented systems
To study entailment-based approaches, we use the models released by Yin et al. (2019) which are bert-base-uncased models pretrained on GLUE RTE (Dagan et al., 2005;Wang et al., 2019b), MNLI (Williams et al., 2018), andFEVER (Thorne et al., 2018), respectively. We reuse the same scheme for mapping labels into hypotheses using templates and WordNet definition for all datasets 4 , as well as the same mechanism for producing final predictions. We leave more implementation details to the Appendix.
We keep reporting results from these baselines following Yin et al. (2019) for reference: • Majority: Output the most frequent label.
• Word2Vec: Using the average word embeddings to vectorize input and labels, output label with maximum cosine similarity.
• ESA: Representing the text and label in the Wikipedia concept vector space. Using the implementation 5 from Chang et al. (2008).
Moreover, due to the obvious variance in performance for models trained on different NLI datasets, we are also tempted to check how much the performance might degrade when given no NLI data at all for fine-tuning. This corresponds to naively using a raw BERT model which has been pre-trained for next sentence prediction (NSP). For consistency, we use the same premises and hypotheses as the delegate for label names and templates to formulate the sentence pair classification. Since NSP is not predicting for a directional semantic entailment, we also try a variant with all pairs reversed, i.e., setting all hypothesis sentences ahead of premises as input, denoted as NSP(Reverse).

Results and further analysis
The results from all systems on different datasets are displayed in Table 1, including an additional group for MNLI results as we found an even better run overall in our experiments. There are some interesting observations emerge from our extended experiments and analysis. The big difference from various NLI datasets drives us to try a raw BERT without fine-tuning on any NLI data, i.e., merely relying on NSP pre-training for sentence pair classification. The results are shown at the bottom two rows in Table 1, which turn out to be surprisingly strong, especially on topic classification, intent classification, and binary sentiment classification. We conjecture that the raw BERT model has already acquired certain ability of topic distinction and sentiment polarity due to the construction of positive and negative sentence pairs in NSP pre-training to detect pairwise coherence. In this way, NSP could serve as a non-trivial, strong alternative baseline for zero-shot text classification scenarios where the target labels are semantically more concrete (e.g., topics) or more frequently appeared (e.g., words expressing sentiment). In such scenarios, fine-tuning on limited NLI data could weaken the semantic coherence acquired from the raw BERT pre-trained on generic-domain corpora, especially now that fine-tuned models have utilized many spurious lexical cooccurrence features as shown in many similar sentence pair classification models (Feng et al., 2019;Niven and Kao, 2019), possibly due to the inherent lexical bias from the current NLI datasets collected from crowd workers. 6 Readers who are curious about more details on this problem can refer to our qualitative analysis in the Appendix which could hopefully help establish 6 Some readers might guess that other NLI datasets collected via a more careful process (Jiang and de Marneffe, 2019; Eisenschlos et al., 2021) might partially mitigate the bias appearing from crowdsourced annotation, but this does not mean that such better intended datasets can be free from statistically biased lexical distributions with coincidental cooccurrences that could be utilized by our strong data-fitting models during fine-tuning (Geirhos et al., 2020;Du et al., 2021). Our additional results described in the Appendix do not seem to be promising on this direction towards better NLI data. a slightly better sense on the behavioral difference introduced by NLI fine-tuning.
On the other hand, fine-tuning on NLI data might seem to be marginally helpful for more abstract cases such as emotion and situation typing, but the performance metrics are in fact pathetically disappointing across all systems.

How stable are these NLI models?
Apart from the obvious difference caused by different training data, there underlies a more serious concern: the discrepancy between the training task (NLI) and the target usage (classification). The gap in task formatting (and henceforth data distribution) naturally raises a question: do NLI models with similar in-domain performance generalize similarly for text classification?
We train NLI models on the largest MNLI dataset with varied hyperparameter settings and random seeds, and keep models achieving similarly strong in-domain generalization performance as measured by the early-stopping dev set performance. Results are listed in Table 2, where the absolute differences between the worst and the best are large, especially on classifying topic or intent. We observe even worse trends on other smaller NLI datasets (see Appendix). These results are consistent with recent studies within the scope of NLI reporting that BERT instances which achieve similar performance metrics on standard NLI datasets could have huge variance in out-ofdistribution generalization or linguistic stress testing (McCoy et al., 2020;Zhou et al., 2020;Geiger et al., 2020), while providing another instance of the underspecification problem in modern machine learning (D'Amour et al., 2020).
As a verification, we also try to tune the models for different development sets that better characterize the generalization behavior for zero-shot   Table 3, where we can clearly see more stable generalization performance. This observation necessitates that a certain amount of annotated data for targeted classification already existed, making NLI models difficult to apply in practice. Results in this part reveals that text classification via NLI is asking for out-of-distribution generalization, a property that current NLI models rarely have, henceforth susceptible to huge instability.   training. We experiment with three schemes on the MNLI data to see whether they could lead to better generalization of zero-shot classification: (1)  There exist additional solutions with richer details such as multi-task learning (Tu et al., 2020) where proper auxiliary tasks could be identified to improve robustness. We plan to explore more in this line in our more extensive future study.
The results are shown in Table 5 . All the three debiasing methods improve the NLI performance on the HANS dataset (McCoy et al., 2019) for robustness testing, indicating that the debiased models overcome the word overlap heuristics to some extend. In general, we do not observe any real improvement other than the neglectable gains on emotion and situation datasets where the original performance is pathetically low.

Conclusion and Discussion
We investigate entailment-based zero-shot text classification further with extended analysis, uncovering the following overlooked issues: • Raw BERT models trained for next sentence prediction are surprisingly strong baselines and NLI fine-tuning does not bring performance gain on many classification datasets.
• Large variance on different classification scenarios and instability to different runs, still requiring annotated data (at least used for validation) to stablize generalization performance.
• NLI models usually rely heavily on shallow lexical patterns, which hampers generalization as required by text classification, and currently more robust NLI methods might not help.
Our observations reveal implicit but massive difficulties in building a usable zero-shot text classifier based on text entailment models. Given the difficulty of NLI data collection that aims at out-ofdomain generalization or transfer learning (Bowman et al., 2020), we question the feasibility of this setup in the current progress of language technology. Before significant progress in language understanding and reasoning, it seems more promising to consider alternative schemes built on explicit external knowledge (Zellers and Choi, 2017;Rios and Kavuluru, 2018;Zhang et al., 2019) or more crafted usage of pre-trained models that hopefully have captured more comprehensive semantic coverage and better compositionality from large corpora or grounded texts (Meng et al., 2020;Brown et al., 2020;Radford et al., 2021).
This study also implies the huge difficulty for benchmarking zero-shot text classification without any further restriction on the task setting. The three datasets used by Yin et al. (2019) were originally intended for diverse coverage but are not sufficient to draw consistent conclusions as we have shown. We suggest future studies on zero-shot text classification either conduct experiments over even more diverse classification scenarios to verify any claimed generality, or directly focus on more specific task settings and verify claims within a smaller but clearer scope such as zero-shot intent classification or zero-shot situation typing for more reliable results with less instability, and perhaps based on more carefully curated data (Rogers, 2021).

A.1 Additional Experimental Details
Templates for generating hypothesis For Yahoo, Emotion, and Situation datasets, we followed Yin et al. (2019) and just explored the label names and WordNet definition accompanied with a template 10 to convert labels to hypotheses for entailment-based models. When applying NSP, we only used label names to generate hypotheses as we did not observe real improvement from using Word-Net definitions in our preliminary experiments. For AGNews, SST-2, and Snips, we simply used the label names to fill the templates. The templates we used are given in Table A.1.
Other implementation details For all experiments, we train BERT models by using bert-baseuncased version and code from the HuggingFace library (Wolf et al., 2019). We used the same prediction strategy as Yin et al. (2019): we pick the label with the maximal probability in single-label scenarios while choosing all the labels with "next sentence" decision in multi-label cases for both NSP and NSP(Reverse) baselines.
Label spaces of classification The labels of each dataset we used are listed in Table A   A.2 Qualitative Analysis Table 1 shows that NSP(reverse) achieves better performance than NSP on several datasets. This could be related to the templates we used for generating previous or next sentences. For example, for the input "play the god that failed on vimeo" with label "PlayMusic", NSP(Reverse) predicts "Play-Music" while NSP predicts "AddToPlaylist". It is a more natural expression for "I want to play music. play the god that failed on vimeo" than "play the god that failed on vimeo. I want to play music". Among the entailment models, We find the RTEbased model performs best on situation dataset. The main class of situation dataset is the "none" label. As shown in Figure A.1, we find RTE-based model performs best on "none" label. Actually, if we calculate the average number of prediction labels each instance, we find NSP, NSP(Reverse), and FEVER's average prediction label number per instance is about 6.2 to 8.3, while RTE and MNLI's average number is about 1, which is closer to the average number of gold labels per instance. The implies NSP is not good at identifying the "none" label since the condition of predicting "entailment" (a premise entails its hypothesis) is more strict than predicting a "next sentence" label. For SST-2, we observe that all three entailment models tend to mislabel sentences with "negative" label as "positive". This may be attributed to the label word distribution in NLI datasets. We find the keyword "great" for positive label is much more frequently occurred than the keyword "terrible" for negative label in all the three NLI datasets. Case study To get a better understanding of NLI models' behavior, we carry out a case study on SNIPS. We use Integrated Gradient (Sundararajan et al., 2017) method to attribute the entailment class's output score of BERT model to per input token 11 . Several examples are shown in Table A.3. We found the NLI models sometimes rely on spurious patterns to do prediction. In the first example, the model finetuned on FEVER assigns a high negative attribution score to the word "zero" and makes a wrong prediction. However, if we replace "zero" with other numbers, the model changes its prediction and can correctly predicts the "Rate-Book" label. These examples reflect model trained on FEVER dataset learns the spurious correlations between "not entailment" label and the occurrence of word "zero" 12 . These superficial patterns may not be the models' main behaviour for prediction, it still leaks the model's fragility and could be an important factor to the model's failure in zero-shot scenario.
The other two groups of cases show another problem: current NLI models only predict "entailment" label when the premise entails its hypothesis, this problem definition is just different from the zero-shot test tasks. For example, in the last group, model trained on MNLI outputs a low probability for entailment since "restaurant" can not be directly inferred from premise sentence. If we change "restaurant" into "place", the model confidently predicts "entailment" .
Error cases We also show some additional examples in Table A.4, from which we might naturally conjecture that the entailment models could rely on spurious lexical features for prediction.
Impact of template choice How to properly choose templates is another issue when utlizing NLI for zero-shot classification. As shown in Table  A.5, different templates that seem meaningful to human might have large performance variance on SST-2.

A.3 Details for Stability Experiments
Details for training settings For MNLI dataset, we merge the neutral and contradiction labels into not-entailment label following Yin et al. (2019). We choose hyperparameters randomly for different 11 we use inputs which replace all tokens with pad token except for [SEP] and [CLS] as baseline of the attribution method. 12 There are 407 premise and hypothesis pairs which contain word "zero" with a REFUTES label, while 122 pairs with a SUPPORTS label. runs: we choose learning rate from {2e −5 , 3e −5 , 5e −5 }, training epochs from {3, 4, 5} and randomly set the random seed.
Results for training on RTE As shown in Table  A.6, the performance of different runs has large variance on both RTE dev and text classification datasets due to its small size.
Reorganize dev and test sets for Yahoo and AG-News We reorganize the Yahoo development set provided by Yin et al. (2019) and divide test set as follows: For the dev set, the instances with label in set {"Society & Culture", "Health", "Computers & Internet", "Business & Finance", "Family & Relationships"} are preserved, we call this new dev set as Yahoo-dev . For the original test set, we only select instances with the label which doesn't appear in the dev set as our new test set, denoted as Yahootest. During the NLI model training, we select the checkpoint by the performance on Yahoo-dev, and we report the variance of five different runs trained on MNLI. We also conduct experiments on AGNews in the same way. We use {"World", "Sports"} as seen labels and select 1800 instances per seen label randomly in train data as our new development set. In the same way, we get dev set : AGNews-dev and our test set AGNews-test.
A.4 Details of Robust NLI models Details for training settings For all the models, we use the same set of hyperparameters: We train all the models with batch size of 64, the Adam optimizer with the initial learning rate of 2e −5 and finetune the BERT model for 3 epochs. The maximum sequence length is limited to 128.
For DA (data augmentation) method, we use the most effective strategy which is called inversion with a transformed hypothesis in Min et al. (2020) . For the bias model used in Reweight and BiasProduct, we use the feature based word overlap bias model 13 in Clark et al. (2019).  Table A.3: Examples for visualization of attribution score. Each example is followed by the model's prediction probability for entailment class. "Predict" column shows the model's predicted class with its entailment probability for the input premise text and "Gold-Std." column displays the true labels. The red color represents negative attribution score and the blue color represents positive score for entailment class. Better viewed in color.

Text with Gold-standard and Predicted labels
• Gold-standard: Computers&Internet • Prediction: Entertainment&Music (MNLI, RTE), Computers&Internet (FEVER) Is it possible to rip the music from PS2 games ? No i dont think thats possible because your computer cant understand the data format your ps2 games . Ive also never heard of that being done so id have to say no .
• Gold-standard: Education&Reference • Prediction:Family&Relationships(RTE,FEVER,MNLI) Who or which company would do the best family history and genealogy research for me in Utah ? I know if you go to the Mormon Church , they can provide tons of answers about your genealogy , and probably suggest a company or person who would do the work for you .
• Gold-standard: BookRestaurant • Prediction: RateBook (RTE,FEVER,MNLI) book a bakery for lebanese on january 11th 2032 • Gold-standard: BookRestaurant • Prediction: RateBook(RTE,FEVER,MNLI) book a highly rated place in in in seven years at a pub • Gold-standard: Negative • Prediction: Positive (RTE,FEVER,MNLI) outer-space buffs might love this film , but others will find its pleasures intermittent .  Table A.7: HANS accuracy of BERT pretrained on MNLI and different debiasing methods, broken down by the heuristic that the example is diagnostic of and by its gold label. L represents for Lexical Overlap Heuristic, S represents for Subsequence Heuristic, and C represents for the Constituent Heuristic.