Label Agnostic Pre-training for Zero-shot Text Classification

Conventional approaches to text classification typically assume the existence of a fixed set of predefined labels to which a given text can be classified. However, in real-world applications, there exists an infinite label space for describing a given text. In addition, depending on the aspect (sentiment, topic, etc.) and domain of the text (finance, legal, etc.), the interpretation of the label can vary greatly. This makes the task of text classification, particularly in the zero-shot scenario, extremely challenging. In this paper, we investigate the task of zero-shot text classification with the aim of improving the ability of pre-trained language models (PLMs) to generalize to both seen and unseen data across varying aspects and domains. To solve this we introduce two new simple yet effective pre-training strategies, Implicit and Explicit pre-training. These methods inject aspect-level understanding into the model at train time with the goal of conditioning the model to build task-level understanding. To evaluate this, we construct and release UTCD, a new benchmark dataset for evaluating text classification in zero-shot settings. Experimental results on UTCD show that our approach achieves improved zero-shot generalization on a suite of challenging datasets across an array of zero-shot formalizations.


Introduction
Text classification is the process of categorizing text into sets of organized groups where each set consists of similar content in a well-defined manner (Minaee et al., 2021;Joulin et al., 2016).Supervised approaches have achieved great success in recent years due to the availability of rich training data and the advent of large pre-trained language models such as BERT (Devlin et al., 2018).These conventional approaches typically assume the presence of a pre-defined set of labels to which a given text can be classified.However, in real-world applications, several challenges emerge: 1) The label space is constantly evolving.Over time, new labels are constantly emerging and the definition of the label space is constantly being refined.For example, intent classification systems such as those used in chatbots and dialogue systems are constantly introducing new intents as their range of supported features increases.Social networks such as Twitter encounter new and emerging topics on a daily basis from massive amounts of content that need to be classified.Figure 1 shows an example of this emerging label space.
2) The range of applications for text classification is vast.Text classification is pivotal to many different application areas from sentiment analysis to topic labeling, etc, and is used in a variety of domains such as finance, health, etc.When applied to this conglomeration of uses, it is typically assumed that there exists a comprehensive dataset of well-defined text-label pairs for each use case.However, in many real-world settings, annotated data is either scarce or unavailable entirely.Additionally, the use of dedicated models for each task is impractical due to the additional compute overhead and maintenance, thus making it difficult to scale over time.
Zero-shot learning (ZSL) is aimed at addressing these constraints.Zero-shot Learners are models capable of predicting unseen classes.When applied to text classification, these models aim to associate a piece of text with a given label without the need for having been trained on that label.However, despite recent advancements in the capabilities of PLMs, zero-shot models still vastly underperform their supervised counterparts (Pushp and Srivastava, 2017;Puri and Catanzaro, 2019;Brown et al., 2020).As such, this remains an open research problem.
In this paper, we investigate the challenge of reducing the aforementioned performance gap present in these zero-shot models compared to their supervised counterparts on unseen data.We theorize that the poor generalization of these zero-shot models is due to their lack of aspect-level understanding during their training process.To alleviate this we introduce two new simple yet effective pre-training strategies, Implicit and Explicit pre-training which specifically inject aspect-level understanding into the model.
In order to evaluate these strategies, we canvas the range of zero-shot formalizations for enabling zero-shot text classification on PLMs and apply our techniques.Additionally, we introduce the Universal Text Classification Dataset (UTCD), a largescale text classification dataset for evaluating zeroshot text classification.UTCD is a compilation of 18 classification datasets spanning 3 main aspects of Sentiment, Intent/Dialogue, and Topic classification.Our results on UTCD show that by employing both our implicit and explicit pre-training strategies we can achieve improved zero-shot performance on a suite of challenging datasets for which the model was not trained on.
Specifically, this paper makes the following contributions: • We introduce Implicit & Explicit pre-training, two new simple yet effective pre-training strategies for improving zero-shot performance.
• We construct and release UTCD, a new benchmark dataset for evaluating text classification systems across a suite of diverse tasks and domains.We release our models and dataset1 .
• We conduct a thorough evaluation of various zero-shot text classification formalizations showing the effectiveness of our training strategies on each as well as insights gained.

Task Formulation
In this section, we introduce the task of zero-shot text classification and describe a set of formalizations for facilitating the classification of text in a zero-shot manner, i.e. being able to predict unseen labels.
Conventional Text Classification Text classification approaches using PLMs assume the existence of a pre-defined set of labels ty i u 1 n where for a given input sequence X, the model outputs a representation of that sequence as a sequence of hidden states th i u 1 l .Hidden states in the final layer are pooled to a single vector.In the case of BERT (Devlin et al., 2018), the rCLSs token is taken, and a linear softmax layer is added to predict the probability distribution of the label set: For the zero-shot scenario, this approach breaks since the output class set ty i u 1 n is fixed.This prevents the classification of text to new labels unless the model is re-trained with the new label set or a mapping of existing labels to unseen labels is built, both of which are impractical and cumbersome for real-world scenarios.

Binary Zero-shot Classification
To facilitate zero-shot classification of PLMs, Halder et al. (2020); Pushp and Srivastava (2017); Yin et al. (2019) formulate text classification as a series of binary classification tasks: The model is provided with a concatenation of the class label labelpy i q and input text and the output layer generates a binary True{False prediction with a confidence score P. The True-prediction class with the highest confidence is selected as the final prediction, that is, ŷ " arg max i P t1...nu where n is the number of classes/labels.Such crossattention (CA) models apply attention layers on the text and labels jointly, which intuitively allows for rich interactions.This architecture is shown in part (a) of Figure 2.

Dual Encoding Zero-shot Classification
In contrast to cross-attention based architectures, Dual Encoder models (Reimers and Gurevych, 2019;Casanueva et al., 2020a;Clarke et al., 2022) instead focus on learning representations for a given text and label independently.They separately embed the text and label, via an encoder Φ and compute pair-wise scores S based on the encoded representations with a distance metric Dist, such as dot-product or cosine similarity: Spx, y i q " Dist pΦpxq, Φpy i qq (4) Sentence-Bert (Reimers and Gurevych, 2019) takes PLMs such as BERT and RoBERTa as the base encoder and use siamese networks to derive sentence embeddings by comparing similarities between sentence pairs as shown in part (b) of Figure 2.For text classification, this architecture can be used to derive latent representations for a given text and label and classify a sequence x according to: Spx, y i q (5)

Generative Classification
Lastly, the generative formulation of zero-shot text classification uses autoregressive language models by passing in text and label sets as natural language prompts and training the model to generate the target label token by token.As described in Puri and Catanzaro (2019), we reformulate the text classification problem as a multiple choice question answering problem.The model is provided with a multiple-choice question description containing each class label in natural language, and trained to generate the correct answer, as shown in part (c) of Figure 2. The intuition behind this approach is to train the model to use common sense reasoning to select the most probable description of the text data from a provided list of rich natural language classes.Given some input text t, the model is optimized with the next token prediction language modeling loss: 3 Method In this section, we outline the methodology for our Implicit & Explicit pre-training strategies which allow us to inject aspect-specific knowledge into PLMs to improve generalization to unseen data.We first define the term aspect and outline the gap between the performance of the zero-shot models shown in section 2 on seen data compared to that of unseen data.Lastly, we describe our intuition behind why localization of aspect knowledge helps to bridge this gap.
Aspect Definition In the scope of this work, we define an aspect as the type of task to which a given set of datasets belong too.For example, sentiment is considered an aspect because it cleanly defines a task definition of understanding the emotion conveyed in a given text.This definition holds true even if the domain of the data changes.e.g senti- ment detection of news data vs sentiment of social media tweets.In addition to having a clean task definition, we stipulate that the set of labels considered in a given aspect must convey that aspect.e.g For intent, the label "turn off alarm" conveys that the text describes the intention to do something.

Transfer Learning for Text Classification
The prevailing method for training models to perform classification tasks is to add a linear head on top of a pre-trained language model and fine-tune the entire network on labeled data (Devlin et al., 2018).However, when scaled to multi-task, multidomain applications these models suffer from issues such as catastrophic forgetting and conflicting knowledge transfer across tasks (Aribandi et al., 2021;Geva et al., 2021;Clark et al., 2019;Alonso and Plank, 2016).We observe a similar trend in the Bert Seq-CLS row of Table 3 and 2, where despite the overarching task of text classification remaining the same when scaling the output space of the classification head to more labels across aspects, we see heavy performance degradation compared to having individual dataset models.For example, in table 3 training a multi-dataset BERT sequence classifier performs worse for every benchmark dataset compared to its single-dataset counterpart.Additionally, for the zero-shot formalizations, we observe the lowest positive transfer on datasets with the lowest level of token overlap between labels seen during training and out-of-domain labels, as shown in Figure 4. We theorize that the reason for this phenomenon is that the model is over-fitting to the specific labels seen during training instead of generalizing to the "aspect".

Implicit Training
In order to introduce aspect specification into our zero-shot models, we take inspiration from T5's (Raffel et al., 2019) text-to-text framework for multi-task generalization.In this framework, the model is fed some text for context and is then asked to produce some output text.As an example, to ask the model to translate the sentence "That is good."from English to German, the model would be fed the sequence "translate English to German: That is good."and would be trained to output "Das ist gut."Similarly, for each aspect (as defined in section 4), we introduce a conditional aspect token to the model input that acts as a context for that specific aspect.As such, in addition to learning the best contextual representation for the <text, label> input pair, the model implicitly learns a higher level understanding of the underlying aspect.By adding this conditional representation, even as the label space changes, the model is better able to understand the aspect at hand.This is shown in part(b) of figure 3.In the case of implicit binary zero-shot classification, the model is additionally provided with a concatenation of the aspect token and the output is selected as: ŷ " arg max iPt1...nu f plabelpy i q, aspectpa y i q, xq (7)

Explicit Training
Given our hypothesis that these language models will be able to generalize to unseen labels as a result of implicitly learning the task at hand, we explore the idea of explicitly training this generalization in a supervised manner.Instead of adding a conditional aspect token, we add an additional pretraining step in which the model is trained on aspect detection.This step acts as an initialization process whereby the model representations are tuned at the aspect level first.Once this step is completed the model is then fine-tuned for its respective zero-shot classification objective.This process is shown in part (c) of figure 3.For a given text x this explicit training step is defined as:

UTCD: Universal Text Classification Dataset
In order to test the zero-shot generalization of these NLP models we introduce UTCD.UTCD is a compilation of 18 classification datasets spanning 3 main aspects of Sentiment, Intent/Dialogue, and Topic classification.A breakdown of each dataset is provided in appendix A. UTCD focuses on the task of zero-shot text classification where the candidate labels are descriptive of the text being classified.
To make NLP models more broadly useful, zeroshot techniques need to be capable of label, domain & aspect transfer.As such, in the construction of UTCD we enforce the following principles: Textual labels In UTCD, we mandate the use of textual labels.While numerical label values are often used in classification tasks, descriptive textual labels such as those present in the datasets across UTCD enable the development of techniques that can leverage the class name which is instrumental in providing zero-shot support.As such, for each of the compiled datasets, labels are standardized such that the labels are descriptive of the text in natural language.
Diverse domains and Sequence lengths In addition to broad coverage of aspects, UTCD compiles diverse data across several domains such as Banking, Finance, Legal, etc each comprising varied length sequences (long and short).The datasets are listed in Table 1.
As described in section 3, we define aspect as the sub-task type to which a given set of datasets can belong too.We simulate the Zero-shot learning case by splitting UTCD into in-domain, data a given model would be trained on, and out-ofdomain, data with novel classes unseen during training.Additionally, to prevent data imbalance across aspects, we sub-sample the in-domain datasets such that the total number of unique text in each aspect is the same while maintaining class label distribution for each dataset.Class imbalance is known to degrade performance in deep learning models (Buda et al., 2018;Ochal et al., 2021).We observe a similar trend where aspect normalization results in performance improvement.

Experimental Setup
Model Architectures For binary classification, we use BERT BASE with sentence pair classification as in Devlin et al. (2018).For dual encoding classification, we use Sentence-BERT (Reimers and Gurevych, 2019)  encoder, mean pooling, and cosine similarity as the distance metric.For generative classification, we use the 345M GPT-2 (Radford et al., 2019) as the language model and the input representation described in Puri and Catanzaro (2019).These models are denoted Binary BERT, Bi-Encoder, and GPT-2 respectively.
Training We train all models with AdamW (Loshchilov and Hutter, 2019) and weight decay of 0.01 on all in-domain data for 3 epochs, for both pre-training and fine-tuning stages.For explicit pre-training, we use a learning rate of 2e-5, batch size of 16, and linear learning rate warmup over the first 10% steps with a cosine schedule.For binary and dual encoding we use a learning rate of 2e-5, batch size of 16, with 10% warmup and a linear schedule.For generative classification fine-tuning, we use a learning rate of 4e-5, batch size of 128, with 1% warmup and a cosine schedule as reported in Puri and Catanzaro (2019).We pre-process data and train all models with different random seeds over multiple runs.

Results & Discussion
In this section we present and analyze the results of our experiments, detailing our insights and discussing the implications of each of our techniques.
Evaluation Task We report accuracy on the test set of all in-domain and out-of-domain datasets.
In multi-label cases where there is more than one valid label, the prediction is considered correct if the model predicts any one of the correct labels.For generative classification, we observe instances in which GPT-2 may not generate one of the label options, a known problem for PLM generation (Radford and Narasimhan, 2018;Pascual et al., 2021).
In such cases, we consider the label option most similar to the generated answer as prediction, by mapping the generated output and the valid classes to an embedding space.For this encoding, we use the pre-trained model MPNet (Song et al., 2020) with mean pooling encoder from Sentence-BERT (Reimers and Gurevych, 2019) for mapping the labels and cosine similarity as the distance metric.This ensures the consistency of GPT-2's output with the other zero-shot formalizations.

Upper-bound & Zero-shot Baselines
To gauge the ability of our models to generalize to unseen data, we establish our upper-bound as the performance of a fully supervised model on the target data.Specifically, we fine-tune two variations of BERT BASE for sequence classification which we denote as "individual" and "full".For individual, we fine-tune a dedicated classification model for each dataset in UTCD.For full, we fine-tune a single model for all datasets.Additionally, we compare the zero-shot performance of our models to the popular LLM GPT-3 (Brown et al., 2020), and BART MNLI (Yin et al., 2019)  Huggingface Hub2 .

Out-of-domain Performance
In table 2, we report results on the out-of-domain test set for UTCD.To evaluate the ability of our zero-shot models to adapt to unseen data, we evaluate our fine-tuned models from table 3 on the outof-domain test set without training on any out-ofdomain data.Across the zero-shot formalizations, we observe that our explicit Binary BERT achieves the best performance with a 2% increase over its vanilla counterpart.Thus showing the power of the explicit pre-training strategy for binary classification formalization.
When compared to the "full" supervised out-ofdomain model, despite having not been trained on any data from the target dataset, across the aspects of sentiment and intent, our models are able to generalize well.Specifically, across all formalizations, our models are able to outperform the supervised model on the financial phrase bank dataset.We observe that this drop is due to conflicting domain data.UTCD's out-of-domain set consists of similar financial datasets in the other aspects of intent and topic.Given that examples from the finance phrase banks dataset are general in nature, without seeing the label, it is difficult for the sequence classifier to understand the task at hand, thus causing it to classify to conflicting labels from similar datasets.This showcases the need to include aspect-specific knowledge.
Lastly, when inspecting the performance of vanilla fine-tuning compared implicit and explicit training, we are able to outperform vanilla on generalizing to unseen data on 6, 6, and 8 of the 9 datasets in out-of-domain UTCD across Binary BERT, Bi-encoder, and GPT-2 models respectively.In particular, for explicit training on Binary BERT, we achieve a massive improvement in zero-shot generalization (as much as +%16 for the topic aspect, +9% on average).Additionally, in comparison to the massive zero-shot baselines of BART and GPT-3 our models are able to outperform on 7 and 8 of the 9 datasets respectively.

In-domain Performance
In table 3, we report results on the in-domain test set for UTCD.For in-domain, we conduct implicit & explicit training across each zero-shot formalization.We observe that when compared with the "full" supervised model, our zero-shot models are more performant while maintaining the flexibility of facilitating zero-shot.When compared with the "individual" variation, as our zero-shot models are trained jointly across different datasets, we achieve better performance than the single supervised model on datasets such as SGD, showing the power of knowledge transfer from other intent datasets such as Clinc-150 & SLURP.
For vanilla fine-tuning without implicit or explicit training, we observe that across zero-shot formalizations, injecting task specification through implicit and explicit pre-training preserves performance for in-domain data.Showing that while achieving better zero-shot transfer ability our models do not suffer performance loss on data already seen during training.

Importance of Label token overlap
In addition to the need for aspect-specific knowledge, we also observe a high correlation in zeroshot generalization results between the overlap of tokens seen during training and those evaluated on the out-of-domain test.Figure 4 shows the pairwise overlap of label tokens across the in-domain and out-of-domain datasets.When inspected across aspects, we see that our models are able to achieve the best out-of-domain performance on datasets with the most overlapping label tokens to those seen during training.

Related Work
Zero-shot text classification is the task of classifying text into novel categories unseen during training.Early zero-shot classification studies frame the problem as binary classification on whether a given label describes the text (Pushp and Srivastava, 2017;Yin et al., 2019).With the advancement of PLMs, subsequent works (Yin et al., 2019;Puri and Catanzaro, 2019) rely on transformer architectures to learn representations from descriptive labels passed in.In particular, Puri and Catanzaro (2019) fine-tune an autoregressive language model to generate titles based on a prompt template containing Tweet articles and a list of title options.Though the model is trained on a great variety of title options, the approach limits the learning to topic classification only, as the authors only analyze performance on topic datasets, unlike our approach which considers a wide array of aspects, each requiring focus on different sections of a given text.Yin et al. (2019) similarly categorize zero-shot text classification by aspects and implicitly introduce aspects during training with a dedicated template for each aspect.They further propose the classification of a text, label pair as a logic entailment problem.However, the authors analyze a less challenging zero-shot case where a model is trained on a subset of text, label pairs, and evaluated on the remaining text with unseen labels in the same domain.Additionally, the authors introduce WordNet definition of the labels as the labels are all single words.This process requires manual intervention and is not applicable for multiple-word label sequences common in intent classification, such as "Check Balance".Our work evaluates a more diverse set of datasets for each aspect and a more

Conclusion
In this paper, we investigate the task of zero-shot text classification with the aim of improving the ability of PLMs to generalize both seen and unseen data across domains without the need for additional training.We introduce two new simple yet effective pre-training strategies, Implicit training & Explicit pre-training which specifically inject aspect-level understanding into the model at train time.To evaluate this, we release UTCD, a new benchmark dataset for evaluating text classification in zeroshot settings.Experimental results on UTCD show that our approach achieves improved zero-shot generalization on a suite of challenging datasets in UTCD and across many zero-shot formalizations.

Limitations
While our approach is shown to be effective in improving the zero-shot adaption ability of these PLMs, the scope of this work has only been extended to English languages and has not been tested on other languages.In addition, another limitation of this work is the scope of the aspect.Aspect is defined across 3 main categories of intent, sentiment, and topic in the work.However, given the massive space of text label interpretations, our aspect range can be refined and expanded even further, lending to more analysis of the stability of implicit & explicit training as the number of aspects grows.We do not investigate this scenario in this work.
where the candidate labels are descriptive of the text being classified.UTCD consists of 6M/800K train/test examples.
For sentiment we have the datasets Go Emotion (Demszky et al., 2020), TweetEval (Barbieri et al., 2020), Emotion (Saravia et al., 2018), Amazon Polarity (Zhang et al., 2015), Finance Phrasebank (Malo et al., 2014) and Yelp (Zhang et al., 2015).The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral.The TweetEval dataset consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification.The tasks include -irony, hate, offensive, stance, emoji, emotion, and sentiment.We used the sentiment portion of this dataset for UTCD.Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.The Amazon Polarity dataset consists of reviews from Amazon.The data spans a period of 18 years, including 35 million reviews up to March 2013.Reviews include product and user information, ratings, and a plaintext review.The Finance Phrasebank dataset consists of 4840 sentences from English language financial news categorised by sentiment.The Yelp dataset consists of over 600k reviews for the task of sentiment classification.
For the intent/dialogue aspect we have the datasets: Schema Guided Dialgoue (Rastogi et al., 2020) is an annotated multi-domain, task-oriented conversations between a human and a virtual assistant.Clinc-150 (Larson et al., 2019) is an intent classification (text classification) dataset consisting of 150 in-domain intent classes.SLURP (Bastianelli et al., 2020) is dialuge dataset derived from SLU systems English spanning 18 domains.Banking77 (Casanueva et al., 2020b) is an intent classification dataset for the banking domain.It comprises 13,083 customer service queries labeled with 77 intents.Snips is an NLU dataset of over 16,000 crowdsourced queries distributed among 7 user intents.NLU Evaluation (Xingkun Liu and Rieser, 2019) is an NLU dataset from the conversational domain annotated with corresponding intents and dialogue scenarios.
Lastly, for the topic aspect we have the datasets: AG News (Zhang et al., 2015) is a topic classification dataset extract from the AG News article corpus.It consist of 4 classes from the original corpus.Each class contains 30,000 training samples and 1,900 testing samples.Yahoo Answers dataset (Zhang et al., 2015) contains 4,483,032 questions and their answers across 10 categories.Each class contains 140,000 training samples and 5,000 testing samples.DBpedia (Auer et al., 2007) dataset is a topic classification dataset constructed from picking 14 non-overlapping classes from DBpedia 2014.Multi Eurlex (Chalkidis et al., 2021) is a multilingual dataset for topic classification of legal documents.The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.Big Patent (Sharma et al., 2019) is a topic classification dataset for the legal domain consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries.Consumer Finance (Bureau, 2012) dataset is a collection of complaints about consumer financial products and services sent to companies for response.

Figure 1 :
Figure 1: Zero-shot Text Classification Problem: In realworld applications, the model needs to adapt to unseen labels.For a given aspect and domain, the interpretation of a given text-label pair can vary greatly.

Figure 2 :
Figure 2: Zero-shot Text Classification Formalizations: Part (a) illustrates the binary classification formalization described in section 2 where concatenated <text, label> pairs are passed as input to the model.Part (b) illustrates dual encoding where text label pairs are encoded separately and scored via a distance metric.Part (c) illustrates text classification where the model generates desired label based on a natural language instruction template.

Figure 3 :
Figure 3: Zero-shot Text Classification Training Strategies.Part (a) shows standard model training where a text and the set of label options are passed to the model.Part (b) illustrates implicit training where the aspect is additionally passed as input.Part (c) shows injecting aspect knowledge to the model explicitly through gradient update, to initialize subsequent training.

Figure 4 :
Figure 4: UTCD Out-of-domain Dataset Label Pair-wise Overlap with In-domain Dataset.0 is no overlap, 100 is exactly the same label set.From sentiment to intent to topic, label overlap decreases in general.

Table 2 :
with BERT BASE as the base Aspect-Normalized out-of-domain accuracy.*Supervised upper bound, not a zero-shot framework.: In case none of the given labels are generated at inference, the generated text is embedded and compared with label embeddings.
; Out-of-the-box zero-shot classifier.
which is the most popular and widely downloaded zero-shot model on