In-Context Learning for Text Classification with Many Labels

In-context learning (ICL) using large language models for tasks with many labels is challenging due to the limited context window, which makes it difficult to fit a sufficient number of examples in the prompt. In this paper, we use a pre-trained dense retrieval model to bypass this limitation, giving the model only a partial view of the full label space for each inference call. Testing with recent open-source LLMs (OPT, LLaMA), we set new state of the art performance in few-shot settings for three common intent classification datasets, with no fine-tuning. We also surpass fine-tuned performance on fine-grained sentiment classification in certain cases. We analyze the performance across number of in-context examples and different model scales, showing that larger models are necessary to effectively make use of larger context lengths for ICL. By running several ablations, we analyze the model’s use of: a) the similarity of the in-context examples to the current input, b) the semantic content of the class names, and c) the correct correspondence between examples and labels. We demonstrate that all three are needed to varying degrees depending on the domain, contrary to certain recent works.


Introduction
In-context learning (ICL) using large language models (LLMs) has recently exploded in popularity.Models pre-trained on massive amounts of textual data are able to reach reasonable performance on a wide variety of tasks with only a few examples of input and output for a given task provided in the model's input prompt in natural language (Brown et al., 2020;Rae et al., 2021;Chowdhery et al., 2023).In this work, we study whether ICL can handle challenging classification tasks with many possible labels, by augmenting the LM with a secondary pre-trained retrieval model.
The main problem with applying ICL to tasks involving classification with many labels is the lim-ited context window these models have.Ordinarily with ICL, at minimum one example from each class is provided in-context to allow the model to make a choice between all the labels of the task.Because of this limitation, ICL has not been directly applied to these sorts of problems.In this work we relax this requirement, allowing the model to see only a subset of the most relevant labels for the given datapoint we are performing inference on.By testing on intent classification (upwards of 50 classes) and fine-grained sentiment analysis (upwards of 25 classes), we demonstrate that the resulting performance with this method can reach SoTA.By coupling the LLM with an external pre-trained dense retriever model (Reimers and Gurevych, 2019a;Karpukhin et al., 2020), we can dynamically retrieve a set of examples to provide to the LM in-context, that reflects only the most relevant labels to the current example in the label space.Most existing work on augmenting LMs with retrieval models (Ram et al., 2023;Shi et al., 2023) focuses on tuning the retrieval and/or LM.We demonstrate that even without tuning either, when the pre-trained models are strong enough we can still achieve SoTA across various tasks using ICL.
We evaluate LLMs in this setting with three intent classification datasets: BANKING77 (Casanueva et al., 2020), HWU64 (Liu et al., 2019), and CLINC150 (Larson et al., 2019), as well as one fine-grained sentiment classification dataset: GoEmotions (Demszky et al., 2020).Experiments are done using the LLaMA models (Touvron et al., 2023) and the OPT models (Zhang et al., 2022) as LLMs.We compare the performance achieved against adapter-based fine-tuning of MLM models (DeBERTa-v2-XXLarge with the "Pfeiffer" bottleneck-style adapter (Pfeiffer et al., 2020b) implemented with AdapterHub (Pfeiffer et al., 2020a)) and the previous SoTA for intent detection (ConvFit; Vulić et al. 2021)  Splits: For the intent detection experiments, to allow for direct comparison with previous works, we use the same 5-shot and 10-shot sets as Di-aloGLUE (Mehri et al., 2020).Experiments are run 3 times and the accuracies are averaged, except the zero-training LLM setups, which are deterministic.For the GoEmotions experiments we average the results across 3 different random 10 and 5-shot splits, as no pre-existing few-shot splits exist.The GoEmotions experiments are composed of the subset of GoEmotions data (84% of training set, 85% of testing set) where the there is only one emotion label, to avoid issues of enforcing an ordering on a linearized version of multiple labels in sequence, as well as to mimic the single-label intent detection datasets setup more closely.Default library parameters were used.
Computing Hardware and model differences: All experiments were performed on a single A100 80GB GPU, except those with OPT 175B, which were performed with 8 A100 GPUs.For LLaMA 65B and 70B 8-bit quantization was used.The main difference between the OPT and LLaMA models is the amount of pre-training data used.
The LLaMA models were trained on 1T-1.4Ttokens, while the OPT models were only trained on 180B tokens (see (Zhang et al., 2022) and(Touvron et al., 2023) for more details).LLaMA-2 models were trained on 2T tokens.

Restricting model output:
To reduce computational load and make inference easier, instead of using the logits of the LLM to rank our many classes (requiring multiple forward passes, as class names consist of multiple tokens), we let the LLM generate freely.Having generated an output text, we then use the retrieval model (SBERT) to retrieve the most similar class label from our set of classes.This allows us to restrict the model output to the set of classes we want without incurring additional inference cost.Instances of generated predictions that do not match our class list are few regardless, and shrink proportionately to the number of examples provided in-context.
Baselines: Several baselines are provided.The baseline "Pre-trained SBERT 1-NN" refers to using the SBERT retrieval model to retrieve the most similar example in the retrieval pool and use its label directly as the prediction (1-nearest-neighbor).The ConvFit baseline is taken from the reported numbers in the ConvFit paper directly.The baseline "DeBERTa (Pfeiffer)" is the DeBERTa-XXL model released by Microsoft, trained via AdapterHub with the Pfeiffer-style bottleneck adapters (Pfeiffer et al., 2020b,a).Preliminary results with other adapter types (LoRA, IA 3 , etc.) showed that the Pfeifferstyle adapters were the most effective in this particular use-case.The DeBERTa-XXL model was finetuned until performance saturation (early stopping).
SetFit (Tunstall et al., 2022) results are also provided, a method involving contrastive fine-tuning of a retriever model with a classification head, as it is also a competitive and lightweight baseline in this setup.The selection of baselines was done based on recent strong progress on few-shot classification using parameter-efficient fine-tuning, in certain cases having been shown to perform better than full fine-tuning (Liu et al., 2022a).

Results
Example ordering: We provide a brief study regarding how to order examples in-prompt by similarity, since previous work has been inconclusive on this front, suggesting that the ideal ordering is dataset dependent (Liu et al., 2022b).As seen from Table 3, least-to-most (LTM) similar was the most effective ordering across all datasets.Larger models are significantly less sensitive to ordering.
SoTA performance: Tables 1 and 2 shows the performance comparison of all methods.Performance of the retrieval+ICL pipeline on BANKING, HWU and CLINC is state of the art in both the 5 and 10-shot settings.Not only this, but to significantly surpass the previous state of the art for all three intent classification datasets only LLaMA-2 7B is necessary, which with 8-bit quantization can be run on consumer hardware.In the most challenging evaluation setting (the highly-specialized intent classes of the BANKING dataset in the most data-scarce 5-shot setting), the margin between De-BERTa and LLaMA-2 70B is 7.49%.In general the DeBERTa model showed lower performance in the 5-shot scenarios, likely due to the extremely limited data.In the case of GoEmotions (Table 2), when using the neutral category, the Retrieval+ICL pipeline manages to clearly win against the strongest baseline (SetFit) only in the 5-shot case.In the 10-shot case, we can see that Retrieval+ICL performs at least on par, but more likely better than SetFit.
Table 4 shows the difficulty of the GoEmotions task, specifically with regards to how granular the classes are.
Performance degredation: We also provide a study of how performance changes given the number of examples provided in-context.Figure 2 shows this variation for the HWU64 dataset.The xaxis value of 110 indicates a fully saturated context window, which is on average this number of examples.In the case of LLaMA-7B, performance somewhat degrades after a certain number of demonstrations.Looking at Tables 1 and 2, comparing  LLaMA-2-7B and LLaMA-2-70B in the regular and 4K context window scenarios, we see very clearly that only the 70B model is able to continually improve with the full 4K context.The 7B model instead sees matching (no improvement) or degraded performance in most cases.
Impact of "Neutral" on GoEmotions: From the results in Table 2, by comparing the results with and without the "neutral" category, we see that the difference between the baselines and Retrieval+ICL grows, implying that "neutral" disproportionately hurts the Retrieval+ICL performance.We note that correctly predicting the neural class is challenging for the LM.We demonstrate that removing "neutral" from the retrieval pool does not harm performance ("Retrieval without Neutral" in Table 2).Analyzing the results for one of the runs, we see that out of the 1605 examples of the "neutral" class in the test set, "neutral" only appears in the top 3 classes retrieved by the retriever (by number of examples) only 9% of the time (in the top 5 classes 18%).This suggests that the retriever may be limiting the performance.

Ablation Studies
Several ablations studies are done to test what aspects of the retrieved examples the LLM is using to make the predictions.The ablation studies were done on a random split of the HWU dataset and the GoEmotions dataset.Ablation results for HWU are shown visually in Figure 3 and for GoEmotions in Figure 4.
1. Obfuscated labels: We change all the class names to randomly set enumerated names ("Class 1", "Class 2", etc.).The intent is to disentangle the model's use of prior (pre-training) knowledge to perform the task (based on the semantic content of the label names) from the input-output provided in the prompt.

Resampled in-context examples:
To test if similarity between the demonstrations provided in the prompt and the current input example is actually necessary for effective performance.By resampling from the classes initially retrieved by the retriever model, we preserve the distribution of labels but change the input demonstrations themselves so that they are no longer the nearest in the embedding space for each class.6 Discussion

Small models cannot use long contexts as effectively as large models
One trend noticeable from the performance graph as a function of the number of examples for HWU (see Figure 2) is that small models seem to be unable to use more examples as effectively as large models.The smaller OPT model is unable to effectively make use of the entire context window when it is filled and remains at relatively low performance.In contrast, OPT 175B shows continual improvement when more examples are added.A similar trend is visible for the LLaMA models, where the performance of the 7B model does not change significantly (see 2), but the 65B model is able to continuously improve.The smaller models either level off (OPT-13B) or lose performance (LLaMA-7B).In the 4K full context window settings for LLaMA-2, the difference between model scales is even more apparent (Tables 1 and 2).We see the small model showing inconsistent use of the longer contexts; sometimes improving, but mostly staying the same or worsening performance.Meanwhile, the large model consistently improves with the full context in almost all cases.

Similarity to current datapoint matters for intent classification
In the resampling ablation for HWU (see Figure 3) we see that resampling from the initial class distribution provided by the retriever model damages the performance across both OPT 175B and LLaMA 7B.This supports the strong performance numbers of the LLMs, showing that the similarity between in-context demonstrations and the current input matters.This implies that the LM is doing more than just selecting the most common class or just using the shortlist of class labels from the full set of classes to select in a more zero-shot fashion.One interesting difference to note is that OPT 175B, the larger model, shows a larger drop from the resampling as the number of in-context demonstrations increases, compared to LLaMA-7B, whose performance stays roughly constant (but lower than non-resampled).This may indicate that the LLaMA models with their additional training data are more robust to the resampling process, due to stronger pre-training knowledge and/or more robust performance overall.In the case of GoEmotions, we see almost no variation with resampling, showing that similarity to the input example is less influential, though the ordering of the examples relative to each other does seem to make a difference for the 7B model (Table 3).

Semantically significant label names matter greatly for sentiment classification
In the obfuscation ablation (see Figure 3), we see that all models are hurt by obfuscating label names.We see however that models are still able to learn to perform the task effectively, and in fact show similar improvement curves with increasing number of examples, just with a lower starting performance.This demonstrates that the semantic content of the labels is significantly useful to the models but simultaneously it is not integral to performing the task, which can also be done without semantically significant labels.In the case of GoEmotions, we see that the obfuscated labels particularly hurt the model, bringing it down significantly.It seems to be the case that the class names are integral to performance, but at the same time more examples are still helpful to the model, as in the 4K context window it still sees improved performance.

Input-label correspondence matters for all datasets
Shuffling the input-label correspondence is the ablation in which we see the performance of all the models decrease the most in the intent detection case (see Figure 3).Specifically, we see that the performance drop is proportional to the number of examples (more shuffled examples brings a larger drop).That being said, it is noteworthy that the performance of both models in this shuffled regime is still significantly above random chance for every number of demonstrations shown, implying perhaps that the LM's prior knowledge based on the label names is still contributing significantly to performance.In all 4 datasets (intent classification and GoEmotions), shuffling the labels hurts the large model more in particular.This aligns with the results of Wei et al. (2023), whose authors show that larger models are more able to learn perturbed input correspondences than smaller models, which manifests in this experiment as lower performance.
In other words, the larger model is trying to learn the perturbed input correspondence, and thus losing more and more performance with more examples, while the smaller model is able to more effectively ignore the perturbation.

Retriever and LM Generalization
One interesting result from our experiments is the fact that generic retrievers seem to be able to quite effectively generalize across domains and tasks.
Using the same exact retriever model across 3 different intent detection datasets (which according to the taxonomy of Hupkes et al. (2022) constitutes cross-task generalization) as well as a sentiment classification dataset (according to the previous taxonomy, a cross-domain generalization) demonstrates SoTA or better performance in almost all cases.The distribution shift locus, for both the retriever and the language model generating the final prediction, is from pretraining to testing time.This is because they are both pre-trained on massive generic data before being tested in a zero-shot setting.

Related Work
Nearest neighbor selection of in-context examples: One of the earliest studies of the role of example selection in ICL is "KATE" (Liu et al., 2022b).In this paper, the authors probe the performance of GPT-3 on NLP tasks using KNN retrieval (RoBERTa) for example selection.They compare this method against random selection and using the retrieval model directly (plain KNN).They also examine the effect of example ordering on performance and conclude that the most performant ordering (least-to-most and most-to-least similar orderings are tested) depends on the dataset.In our work, we also experiment with example ordering, and conclude that least-to-most ordering is the most effective across all datasets tested.
Works demonstrating order instability: Several recent works have demonstrated that the order of in-context examples makes a larger difference in performance, including Lu et al. (2022); Zhao et al. (2021).These works demonstrate such order instability that certain permutations bring near SoTA performance on tasks while others perform at near random guessing.
Fine alizing to long context lengths, as well as providing an explanation for LMs' sensitivity to ordering (positional embeddings).In Liu et al. (2023), the authors investigate the impact of long contexts on document question answering, finding that the positions of the answers within the context matter greatly for performance, and generally demonstrating that longer contexts cause lower performance.
In this work we show that larger models are needed to effectively take advantage of long contexts for ICL.
Few-shot intent detection: The current state of the art in few-shot intent detection is the ConvFit method (Vulić et al., 2021).ConvFit uses a pretrained LM in a dual-encoder configuration (e.g.BERT or RoBERTa) with two training stages.The first stage is a conversational fine-tuning stage using a generic conversational corpus with a retrieval task (using tuples of (context, response) retrieve the correct response for each context).The second stage is fine-tuning on the specific intent classification dataset with a contrastive loss, allowing the resulting LM to be used in a KNN fashion.

Conclusion
In this work, we show that ICL with off-the-shelf frozen pre-trained retriever models can provide strong performance for text classification tasks with many labels.We show state of the art performance across three different intent classification datasets, and competitive performance with fine-grained sentiment classification.We also show that larger models are necessary to make use of more in-context examples, whereas small models mostly plateau or even show decreasing performance after a point.Through several ablation experiments, we demonstrate that LMs make use of all aspects of the input examples: semantically significant label names, correct input-label correspondences, as well as the similarity between the in-context demonstrations and the current input point, however to varying degrees depending on the dataset and domain.
10 Acknowledgement SR is supported by the Canada CIFAR AI Chairs program and the NSERC Discovery Grant program.AM is supported by an IVADO Excellence Scholarship.
One limitation of the research in this paper is that the experiments of this paper use the pre-existing DialoGLUE few-shot splits for each dataset, following the example of prior works and to remain comparable to them (with the exception of the ablation study, which uses a separate split).However, since experiments were done only on this split, it is not necessarily the case that the results/model rankings are transferable to other splits as well (although it is worth noting from Figure 3 that performance on the random ablation split is very similar to the DialoGLUE split, and the model ranking remains the same).This limitation is not the case with GoEmotions, whose results are given as averages across three random splits.Another limitation is the relatively small number of runs/seeds (only 3) due to limitations on compute.One further limitation is that the experiments are all performed on English-language data.

Figure 2 :
Figure 2: HWU performance as a function of the number of examples in prompt.The x-axis scale is nonlinear, meaning that there are diminishing returns with more examples."Sat" (saturated) indicates filling the prompt greedily until the max length is reached.
, as well as The contributions of this work are: 1.We show that retrieval-augmented ICL is an effective way to tackle text classification tasks with many labels without additional tuning of either the retriever or the LM, either matching or outperforming fine-tuned adapter-based and contrastive-pre-training-based methods.Notably, truncating the dataset by showing only a subset to the LM at a time does not prevent us from achieving SoTA performance, and allows us to apply LLMs to problems that they have not been applied to before, 2. We analyze ICL performance over different numbers of examples and demonstrate that larger models better are able to take advantage of more examples in-context than smaller models, which mostly plateau and/or see dedences, and semantically similar demonstrations to the current input).Contrary to this emerging literature, our experiments demonstrate that they are all used to varying degrees, depending on the dataset and domain.

Table 1 :
Intent classification accuracy for retrieval+ICL and baseline methods.All retrieval+ICL results are with 20 in-prompt examples unless otherwise specified.The retrieval/training dataset size is given by the second row of the header (10-shot is 10 examples per class, 5-shot is 5).

Table 2 :
Sentiment classification macro F1 score (following prior work) over 3 random splits for retrieval+ICL and baseline methods.All retrieval+ICL results are from saturating the prompt with in-prompt examples (with a 2K prompt length unless otherwise specified).The retrieval/training dataset size is given by the second row of the header (10-shot is 10 examples per class, 5-shot is 5).+Neut refers to the case where the "neutral" class (lack of emotion) is included in the dataset.

Table 4 :
Sample datapoints from GoEmotions Rubin et al. (2022)everal works employ the use of fine-tuned retrievers, re-rankers, and/or LMs, includingRubin et al. (2022);Ram et al. (2023);Shi et al. (2023).Some, like REPLUG(Shi et al., 2023), use LM feedback in the form of using the LM to score documents to train the retriever.The goal of bothRam et al. (2023)andShi et al. (2023)is to improve language modeling and not ICL ability.Rubin et al. (2022)uses a similar LM-scorebased feedback to train a retriever (like REPLUG) but for ICL.The difference between all of these works and this work is that we demonstrate that an off-the-shelf retriever is sufficient out-of-the-box for SoTA performance with no additional tuning.