A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification

In recent years, large language models (LLMs) have achieved strong performance on benchmark tasks, especially in zero or few-shot settings. However, these benchmarks often do not adequately address the challenges posed in the real-world, such as that of hierarchical classification. In order to address this challenge, we propose refactoring conventional tasks on hierarchical datasets into a more indicative long-tail prediction task.We observe LLMs are more prone to failure in these cases.To address these limitations, we propose the use of entailment-contradiction prediction in conjunction with LLMs, which allows for strong performance in a strict zero-shot setting. Importantly, our method does not require any parameter updates, a resource-intensive process and achieves strong performance across multiple datasets.


Introduction
Large language models (LLMs) with parameters in the order of billions (Brown et al., 2020) have gained significant attention in recent years due to their strong performance on a wide range of natural language processing tasks. These models have achieved impressive results on benchmarks (Chowdhery et al., 2022), particularly in zero or few-shot settings, where they are able to generalize to new tasks and languages with little to no training data. There is, however a difficulty in tuning parameters of these large-scale models due to resource limitations. Additionally, the focus on benchmarks has led to the neglect of real-world challenges, such as that of hierarchical classification. As a result, the long-tail problem (Samuel et al., 2021) has been overlooked. This occurs when a vast number of rare classes occur in the presence of frequent classes for many natural language problems. In many industrial real-world applications, a strong performing method for hierarchical classification can be of direct utility. New product categories are emerging in e-commerce platforms. Existing categories, on the other hand, may not be very intuitive for customers. For example, upon browsing categories such as night creams, we may be unable to find a product in a sibling-node category of creams. This is further highlighted by platforms in which a systematic structure is not created for users; parent nodes may be in place of child nodes, and vice versa (Asghar, 2016). Manually categorizing product categories can be a costly redesigning endeavour. To tackle this problem, we suggest refactoring traditional hierarchical flatlabeled prediction tasks (Liu et al., 2021) to a more indicative long-tail prediction task. This involves structuring the classification task to closely reflect the real-world long-tail distribution of classes. In doing so, we are enabled to leverage LLMs for long-tail prediction tasks in a strict zero-shot classification setting. Through a series of experiments, results in this work show that our proposed method is able to significantly improve the performance over the baseline in several datasets, and holds promise for addressing the long-tail problem in real-world applications. The contributions of this work can be summarized as follows: • We refactor real-world hierarchical taxonomy datasets into long-tailed problems. In doing so, we create a strong testbed to evaluate "strict zeroshot classification" with LLMs.
• We explore utilizing LLMs to enhance the capabilities of entailment-contradiction predictors for long-tail classification. This results in strong capabilities of performing model inference without resource-intensive parameter updates.
• We show through quantitative empirical evidence, that our proposed method is able to overcome limitations of stand-alone large language models. Our method obtains strong performance on longtail classification tasks.

Background and Related Work
Strict Zero-Shot Classification Previous works (Liu et al., 2021;Yin et al., 2019) have explored zero-shot classification extensively under two settings-(i) zero-shot, where testing labels are unseen, i.e. no overlap with the training labels, and (ii) generalized zero-shot, testing labels are partially unseen. In both cases, the model is trained on data from the same distribution as the test data. In our proposed strict zero-shot setting, the model is only trained to learn the entailment relationship from natural language inference (NLI) corpora (Williams et al., 2018). The training data for this model has no overlap with the distribution or semantics of the inference set. Additionally, previous works utilizing NLI have either not examined the utility of LLMs (Ye et al., 2020;Gera et al., 2022), or transfer the capabilities of LLMs to smaller models but have failed to use them in a strict zero-shot setting for long-tail problems, only demonstrating their utility for benchmark tasks (Tam et al., 2021;Schick and Schütze, 2021). Works exploring LLMs have also limited their study to only using them independently without exploring entailment-contradiction prediction (Wei Existing literature in natural language processing has focused on scenarios involving limited data availability, such as few-shot or low-resource settings. It has failed to adequately address the unique challenges presented by long-tail problems. These problems arise when a small number of classes possess a large number of samples, while a large number of classes contain very few samples. Previous works have not delved into the specific use of LLMs or entailment predictors.

Hierarchical Classification
Many real-world problems contain taxonomy data structured in a hierarchical setting. Shown in Figure 2, most previous works make use of this data as a flat-label task (Kowsari et al., 2017;Zhou et al., 2020). It is however, non-trivial to create clean training data for taxonomies, which these methods rely on. This setting also combines parent and child nodes into a multi-label task, thereby increasing the complexity of the problem as siblings amongst leaf nodes are more diverse than parent nodes. Additionally, previous works do not make use of the natural entailment relations in hierarchies. Other works extenuate this problem by opting to utilize flat labels to produce graph representations (Wang et al., 2022a,b;Jiang et al., 2022;Chen et al., 2021). For this reason, the graph representations may have limited value independently, although representations may be used to assist text classification by providing an organized label space. These works only introduce hierarchies to bring order to the label space, but overlook the original task of hierarchical taxonomy classification (Kowsari et al., 2017). For all previous works, difficult to obtain fine-tuning data is required to produce strong sig- Figure 3: Web of Science (WOS) and Amazon Beauty datasets, refactored to a long-tail distribution. Maximum tree depth is shown for Amazon Beauty, which varies from 3-5. Leaf nodes are used in our method regardless of depth.
nals. In our work, we refactor this data into a leaf-node label prediction task with the help of entailment-contradiction relations and LLMs. In doing so, we enable hierarchical taxonomy prediction independent of utilizing training data for the downstream task.

Methodology
In this paper, we investigate the limitations of LLMs in three overlooked settings, when-(i) the model is not provided with sufficient examples due to the input text length, (ii) the label space includes tokens largely unobserved in the model's pretrained vocabulary, and (iii) there are a large number of distractors in the label space (Kojima et al., 2022;Min et al., 2022;Razeghi et al., 2022). These scenarios are common in real-world tasks, but are often overlooked in the development and evaluation of LLMs. To address these challenges, we propose the use of entailment-contradiction prediction (Yin et al., 2019), the task of determining whether a premise logically entails or contradicts a hypothesis. Through our method, we are able to leverage strong reasoning from Yin et al. (2019) with the retrieval abilities of LLMs (Wang et al., 2020) to improve overall performance in a strict zero-shot setting, where the model must classify samples from a new task without any fine-tuning or additional examples used for training from the same distribution as the inference dataset. Importantly, our proposed combined method does not require parameter updates to the LLM, a resource-intensive process that is not always feasible with increasingly large model size (Chowdhery et al., 2022). Our simple framework is shown in Figure 1. Considering the label space, C = {C 1 , C 2 , ...C n } as the set of classes for any given dataset, and a text input, X, we can utilize the entailment predictor, E to make a contradiction, or entailment prediction on each label. This is done by using X as the premise, and "This text is related to C i ." ∀C i ∈ C as the hypothesis (Yin et al., 2019). In our work, the premise may be modified to include the prompt template. The prediction, E(X) lacks any external knowledge and is restricted to the label space, which may result in poor performance. E(X) can however, provide us with an implicit classification of the contradiction relation for sibling nodes. In our work, we use E(X) as an initializer for LLMs. We also regard it as a baseline as it shows strong performance independently. A LLM, L on the other hand, operates in an open space, with aforementioned shortcomings for classification tasks. For our purposes, we can regard this as a noisy knowledge graph (Wang et al., 2020), which may be utilized to predict ancestors or descendants of the target class. We consider the prediction made by the LLM as L(X). It is important to note that L(X) may or may not belong to C. We try several prompts for this purpose, shown in Appendix A.
By combining these two components, we can create a template which utilizes the entailment relation explicitly, and the contradiction relation implicitly by constructing L(E(X)) to deseminate combined information into an entailment predictor for classification. The template we use is task-dependent, and is generally robust given an understanding of the domain. On Web of Sciences we use: "Here is some text that entails E(X): X. What area is this text related to?". For Amazon Beauty, we use "Here is a review that entails E(X): X. What product category is this review related to?". In this setting, our method still meets a barrier due to limitations of LLMs. By constructing a composite function, E(L(E(X)), we are able to leverage our LLM in producing tokens which may steer the entailment predictor to correct its prediction. The template used for this composite function is "Here is some text that entails L(E(X)): X." across all datasets.
General Form: Although our results are reported combining the advantages of L, and E to produce upto the composite function E(L(E(X)), this can be extended as it holds the property of being an iterative composition function to E(L(E(L...E(X)))).
Our observations show this setting having comparable, or marginal improvements with our dataset. However, this may prove to be beneficial in other tasks. We will investigate, and urge other researchers to explore this direction in future work.

Dataset and Experimental Settings
We refactor the widely used Web of Sciences (WOS) with Kowsari et al. (2017), and Amazon Beauty (McAuley et al., 2015) datasets to follow a class-wise long-tail distribution as shown in Figure  3. Additionally, we create two variations of the Amazon Beauty dataset, first in which it contains the same tree depth as WOS, both containing 3000 samples, and second in which all classes are included for their maximum tree depth, containing 5000 samples. We select these datasets as they challenge the shortcomings of LLMs. The input text of providing multiple abstracts in the WOS dataset surpasses the maximum input token length of most transformer-based models. This makes it difficult for models to learn the input distribution, a requirement for showing strong in-context performance (Min et al., 2022). Next, many tokens in the label space of both the WOS and Amazon Beauty datasets rarely occur in pretraining corpora, details of which are provided in the Appendix B. Additionally, both datasets contain a large number of distractors, or incorrect classes in the label space. Further details are provided in Appendix C. All experiments are performed on a single NIVIDA Titan RTX GPU. We use BART-Large-MNLI with 407M parameters as our baseline. We use this model as it outperforms other architectures trained on MNLI for zero-shot classification. For our LLM, we opt to use T0pp (Sanh et al., 2022) with 11B parameters 1 , as previous works show that it matches or exceeds performance of other LLMs such as GPT-3 (Brown et al., 2020) on benchmarks.

Results and Discussion
Results of our method are shown in Table 1. LLMs, due to their limitations, perform poorly as a standalone model for long-tail classification. These results can be improved by priming the model with an entailment predictor through the usage of a prompt. The baseline shows strong performance independent of the LLM, as it operates on a closed label space. The capabilities of the baseline can be enhanced by further explicitly priming it with a entailment relation through a LLM. Rows in which T0pp is initialized, or primed with E are indicated with Primed. Priming the model showcases improvements across all datasets for macro F1. For accuracy, priming the model shows benefit in two out of three datasets. In Figure 4, we show the results of Top-5 predictions for the WOS dataset.
All results are aggregated in Table 1. It is important to highlight that prompt variation led to stable results for our LLM. The variance upon utilizing BART-MNLI is negligible across prompts. The best results are observed upto Top-4 predictions on both accuracy and macro F1 for our method, when the entailment prompt is enhanced with a greater number of tokens corresponding to the output of L(E(X)). The variation between our method and the baseline is much greater for Top-1 predictions, but Top-5 prediction variance is negligible. Detailed results for both depth settings of Amazon Beauty are shown in Appendix C.

Conclusion
In this work, we utilize an LLM in the form of a noisy knowledge graph to enhance the capabilties of an entailment predictor. In doing so, we achieve strong performance in a strict zero-shot setting on several hierarchical prediction tasks. We also show the necessity of refactoring existing hierarchical tasks into long-tail problems that may be more representative of the underlying task itself. The utility in practical industry settings is also highlighted through this setting.

Limitations
In this work, we implicitly utilize the contradiction relation. The authors recognize explicitly including it in a prompt template leads to worse performance due to the injection of noise. Controlled template generation based on a model confidence is unexplored in this work and appears to be a promising direction. Additionally, we recognize the emergence of parameter-efficient methods for training models which are unexplored in this work, which may have utility. These methods are complimentary and may benefit the performance of models as they can be used in conjunction with training paradigms such as contrastive learning to support better representations through explicit utilization of the contradiction relation. In this work, we limit our study to draw attention to the importance of strict zero-shot classification settings with the emergence of LLMs.
Our study can be easily extended to recursively operate on large language models, and entailment predictors. As we observe limited performance benefits in doing so, we conduct our study to show improvements after one complete cycle, given by E(L(E(X)) in Section 3.

Ethics Statement
In this work, we propose a framework which allows for the usage of entailment-contradiction predictors in conjunction with large language models. In doing so, this framework operates in a stict zero-shot setting. While it is possible to tune prompts to select optimal variants through hard/soft prompt tuning strategies, this would require additional computational resources for LLMs with billions of parameters. Our study investigates the usage of LLMs given an understanding of the domain they tend to be used for (e.g., given an understanding of Amazon Beauty containing reviews, a prompt is constructed). Further explanation of prompt templates is contained in Appendix A. Due to the lack of tuning parameters in this work, large language models are largely dependent on pre-training data. Although this can be controlled to some degree by introducing an entailment predictor with a fixed label space, the underlying model does not explicitly contain supervision signals without further training. The framework proposed for inference in this work must hence be used cautiously for sensitive areas and topics. Zihan Wang, Peiyi Wang, Tianyu Liu, Binghuai Lin, Yunbo Cao, Zhifang Sui, and Houfeng Wang. 2022b. Hpt: Hierarchy-aware prompt tuning for hierarchical text classification.

A Prompt Templates
In our work, we try various prompts for WOS and Amazon Beauty to initialize the LLM, and for the entailment predictor. These prompts are shown in Table 2. Initializing prompts for L may show some variance in performance when utilized independently. The prompts used for obtaining L(E(X)) are generally robust with an understanding of the domain, and show a marginal impact on outcome, upon variation. Prompts used for E(L(E(X))) have an insignificant impact on the outcome.

B Statistics
We provide some details of the distribution for Web of Science dataset are provided with the head, and tail of the distribution class names with their respective value count in Table 3. We also provide the details of class-wise distribution for Amazon Beauty (Depth=2), and Amazon Beauty (Depth=3,4,5) datasets in Table 4, and Table 5 respectively. Towards the tail-end of the distribution, we observe several tokens which may infrequently appear in most pretraining corpora, such as "Polycythemia Vera" for the WOS dataset. Updating parameters of a model on data which is heavily skewed towards the tail distribution in the presence of frequently occuring labels can be problematic for language models. Our proposed method in this work is one solution towards this challenging task.

C Detailed Results
We provide some details of results for Top-1, Top-3, and Top-5 accuracies and macro F1 scores in this section. The Web of Sciences dataset results are shown in Table 6. We observe that the accuracy is significantly higher by all of our methods over the baseline, BART-MNLI. The same trends are seen for Macro F1 scores. In predicting Top-3 labels, only our method of Primed+ shows improvement over the baseline. For macro F1, our method in the Top-3 category shows slight improvement over the baseline. For Top-5 predictions on the WOS dataset, our method shows performance marginally below the baseline. Results for Amazon Beauty (Depth=2) are shown in Table 7. There is a large improvement in accuracy using our method on this dataset for Top-1, 3, and 5. For Macro F1, there our method performs marginally worse than the baseline for Top-1 predictions. Our method strongly outperforms the baseline by a large margin for

WOS
What field is this passage related to? + X What area is this text related to? + X X + What area is this text related to? What area is this text related to? + X

Amazon Beauty
Here is a review: + X + What product category is this review related to? X + What product category is this text related to? formance of using int-8 quantization is robust and matches that of bf-16/fp-32 for inference. These settings also provide us with stable results across prompts.
Previous works have performed parameterupdates (Gera et al., 2022;Holtzman et al., 2021) to models to tackle the challenge of many distractors in the label space. This may be practically infeasible due to the requirements of compute in the case of LLMs.
Diversity between category labels is an important factor we observe which attributes to the improvement in performance. Tables 3, 4, 5 contain statistics for labels used. We observed a significant drop in Macro F1 shown in Table 1 for the Amazon Beauty Dataset (Tree Depth=2) in contrast to WOS for the same models due to the lack of diversity in several class names (e.g. "Bath" and "Bathing Accessories"). Similar trends were observed in Amazon Beauty (Tree Depth=3,4,5) for "Eau de Toilette" and "Eau de Parfum", both of which are perfumes.