Text Classification via Large Language Models

Despite the remarkable success of large-scale Language Models (LLMs) such as GPT-3, their performances still significantly underperform fine-tuned models in the task of text classification. This is due to (1) the lack of reasoning ability in addressing complex linguistic phenomena (e.g., intensification, contrast, irony etc); (2) limited number of tokens allowed in in-context learning. In this paper, we introduce Clue And Reasoning Prompting (CARP). CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification: CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for $k$NN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM's generalization ability and the task-specific evidence provided by the full labeled dataset. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks, 97.39 (+1.24) on SST-2, 96.40 (+0.72) on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s. 93.3). More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups. Specifically, using 16 examples per class, CARP achieves comparable performances to supervised models with 1,024 examples per class.

In spite of the success, LLMs with ICL still significantly underperform fine-tuned models for text classification.This is due to two reasons: (1) Text classification requires models with more powerful reasoning abilities to resolve complex linguistic phenomenon including clause composition (e.g., concession, negation, intensification), irony, etc.Recent efforts to improve LLMs' reasoning capabilities (Wei et al., 2022;Kojima et al., 2022;Ye and Durrett, 2022;Zhang et al., 2022b) mainly focus on tackling math problems, and thus are not tailored to addressing the reasoning process necessary for the multitude of intricate linguistic phenomena in text classification; (2) The number of demonstration examples allowed in in-context learning is limited, e.g., the longest context allowed for GPT-3 is 4,096 subtokens.Therefore, LLMs are only able to take the advantage of a small proportion of the training set, performing well below supervised baselines; In this paper, we introduce Clue And Reasoning Prompting (CARP), an extensible, annotationfree and efficient framework for text classification via large language models.To address the reasoning process necessary for handling the linguistic phenomena in text classification, CARP decomposes the reasoning process into three steps, where LLMs are first prompted to find superficial clues (e.g., keywords, tones, semantic relations, etc) in the given text; next, CARP treats the clues and input as premises and induce a diagnostic reasoning process; and finally determine the final label considering the above two steps.We find this progressive reasoning strategy to be effective in enhancing LLMs' ability in language reasoning involved in text classification.Due to the limited number of tokens allowed in context, a more effective demonstration search is needed.CARP uses a fine-tuned (FT) model on the supervised dataset for kNN demonstration search for ICL.Since the fine-tuned model is trained based on taskspecific labels, it guarantees that retrieved samples are close to the input sequence with respect to the task.FT-based demonstration search provides a channel to connect LLMs with the full training set, in spite of the limited number of tokens allowed in demonstrations.This strategy lets the model take the advantage of both the LLMs' generalization abilities and all task-specific evidence provided by the training dataset.

Large Language Models
Large language models (LLMs) are models that are trained using self-teaching algorithms on large unlabeled corpora.LLMs can be broadly divided into three categories based on the model architecture.The first category is the encoder-only model like BERT (Devlin et al., 2018).BERT (300M) (Devlin et al., 2018) and its variants (Liu et al., 2019;Sun et al., 2020;Clark et al., 2020;Feng et al., 2020;Sun et al., 2021) adopt the pre-training then fine-tuning paradigm for NLP tasks: use masked language models as the main training objective for pretraining, and fine-tune the pretrained model in the annotated downstream datasets.The second category is the decoderonly models like GPT (Radford et al., 2019a).GPT (Radford et al., 2019a) uses the decoder of an auto-regressive transformer (Vaswani et al., 2017) model for predicting the next token in a sequence.GPT (Radford et al., 2019a) and its variants (Dai et al., 2019;Keskar et al., 2019;Radford et al., 2019b;Chowdhery et al., 2022;Zhang et al., 2022a) also follow the pre-training then fine-tuning paradigm.GPT-3 (175B) (Brown et al., 2020) proposes to formalize all NLP tasks as generating textual responses condition on the given prompt.The third category is the encoderdecoder models like T5 (Raffel et al., 2020).T5 (11B) (Raffel et al., 2020) and its variants (Lewis et al., 2019;Xue et al., 2020).

In-context Learning
In-context learning (ICL) generates textual responses (i.e., label words) conditioning on the given prompt (usually) with a few annotated examples for downstream tasks.Li and Liang (2021); Zhong et al. (2021); Qin and Eisner (2021) propose to optimize prompts in the continuous space.Rubin et al. (2021) 2022) show that explanations of examples in a few-shot prompt lead to a performance boost.Marasović et al. (2021) find that GPT-3 outperforms other models by a large margin in the explanation generation task.Wei et al. (2022) propose chain-of-thought reasoning and utilized <input, chain-of-thought, output> triples as the prompt for LLMs.Wiegreffe et al. (2021) traine a supervised filter to select explanations generated by GPT-3 on the SNLI and CommonsenseQA tasks.

Overview
We follow the standard prompt-based in-context learning paradigm.Given an input sequence x input = {x 1 , x 2 , ..., x l }, the task of assigning a text-class label to an input text is transformed to generating a pre-defined textual response y ∈ Y verb (e.g., positive, negative, etc) conditioning on the prompt x prompt using a language model.

Prompt Construction
The prompt x prompt , which is constructed based on x, consists of the following three components: (1) Task description x desc generally describes the task.For different classification tasks, e..g, sentiment classification, topic classification, etc, descriptions are different.Take the sentiment classification task as an example, the task description is given as follows: Classify the overall sentiment of the input as positive or negative (2) Demonstration consists of a sequence of annotated examples: where x j demo , 1 ≤ j ≤ k denotes the jth input sequence and y j demo denotes the text which is transformed from the label, e.g., positive or negative for the binary sentiment classification task.Demonstration serves as two purposes: (1) providing the LLM with evidence to consult on for decision making, which will significantly boost performances; (2) provides an output format that LLM's outputs need to follow, so that the output, which takes the form of natural language, can be further easily transformed to labels.It is worth noting that demonstrations are only needed for the few-shot setup, but not for the zero-shot setup.
(3) Input x input is the test text sequence to classify.The prompt x prompt for a test input is constructed by concatenating the task description x desc , a sequence of demonstrations {(x 1 demo , y 1 demo ), ..., (x k demo , y k demo )}, and the test sequence x test , which can be given as follows: {x desc ; \n; <demo> 1 ; \n; ...; <demo> k ; \n; x test }

Demonstration Sampling
The few-shot setup requires demonstrations sampled from the training set.Strategies that we explore include: Random Sampling a straightforward strategy from samplings is to randomly sample k examples from the training set D train for a text sequence x test .
kNN Sampling The key disadvantage for random sampling is that there is no guarantee that selected samples are semantically related to the input sequence.One straightforward alternative is to sample examples that are similar to the test sequence using kNN search (Khandelwal et al., 2020).In this process, the test sequence x test is first mapped to a vector v test using an encoder model f .Then using v test as the query, we search through the entire training set D train to retrieve k nearest text sequence to get k nearest data examples N = {x j , y j } k j=1 as demonstrations.We use the following encoder models to obtain sentence representations and similarity scores: SimCSE is a contrastive learning model for sentence embeddings (Gao et al., 2021). 4  Finetuned Model FT for short.
The key disadvantage of SimCSE (Gao et al., 2021) and other general semantic encoding models (Reimers and Gurevych, 2019;Seonwoo et al., 2022;Sun et al., 2022) is that it measures the general semantic similarity but is not specifically tailored to the text classification task.To resolve this issue, CARP uses the model fine-tuned on the training dataset as the kNN encoder model.Specifically, we first finetune a Roberta model on the training data.Next we use the [CLS] embedding as the sentence level representation for KNN search.Since the finetuned model is trained based on task-specific labels, it guarantees that retrieved samples are close to the input sequence with respect to the task.Using finetuned model provides a channel to connect LLMs with the full training set, in spite of the limited number of tokens allowed in demonstrations.This strategy lets the model take the advantage of both the LLMs' generalization abilities and all taskspecific evidence provided by the training dataset.

Clues Collecting and Reasoning
To enhance the models' reasoning ability in addressing linguistic phenomenon tailored to text classification, we propose a progressive reasoning strategy that involves clue collection, reasoning and decision making.This process also mimics how human decisions: where we first collect evidence from the input, separating chaff from wheat; next we piece together local evidence to form a global picture, which leads to final decision making.Next we first given an overview of the the clue collecting and reasoning process, and then describe implementation details.

Overview
Collecting Clues For a test sequence, clues are local fact evidence such as keywords, phrases, contextual information, semantic meaning, semantic relationships, tones, references, etc.The following is an example for clues of an input: Input: Steers turns in a snappy screenplay that curls at the edges; so clever you want to hate it.Clues: "snappy", "clever", "want to hate it" are clues for determining the sentiment of the input sentence.
Reasoning For reasoning, the LLM is prompted to go beyond superficial keywords to mine deeper perspectives, considering language phenomenon such as negation, intensification, irony, etc), and piece together local evidence to form the final decision.The following example shows the reasoning process to decide the sentiment of the above example based on the evidence collected: 1.The phrase "snappy screenplay" implies that the screenplay is of a high quality and is well-crafted.2. The phrase "curls at the edges" implies that the screenplay is cleverly written.3. ...

Decision Making
Based on the reasoning process, the model makes the decision for the sentiment of the given input: Overall, the clues and reasoning process point to a positive sentiment for the input sentence.
The merits for the incorporation of clue finding and reasonings are as follows: (1) it prompts the model to progressively think and make decisions: clue finding focuses more on superficial features such as keywords, while reasoning makes deeper justifications based on superficial features; (2) clue finding and reasoning serve as a tunnel to let human intervene: in the few-shot setup, where clues and reasons need to be prepared in advance for demonstrations, we can modify them as we see fit.This is extremely helpful for trouble shooting in the prompt-construction stage for error corrections; (3) from an interpretation and uncertainty estimation perspective, clues and reasoning in few-shot setups are human-readable influence functions;

zero-shot scenario
In the zero-shot setup, as no demonstration is allowed, no concrete example for clues and reasons can be provided in the prompt.In this way, we only add requests asking the model to output clues and reasons in the prompt.

few-shot scenario
In the few-shot setup , we need to prepare clues and reasonings for all examples in the training set in advance as all training examples have chances to be selected as demonstrations given different test inputs.Previous efforts in math problems (Wei et al., 2022;Kojima et al., 2022;Ye and Durrett, 2022;Zhang et al., 2022b) prepare hand-drafted reasoning for a few examples, and always use these example as demonstrations.This strategy does not fit for our situation as it is extremely timeintensive to manually generate clues and reasonings for all training examples.To resolve this issue, we harness LLMs for automatic clue and reasoning generation, where we ask LLMs to generate clues and reasoning based on both the input and its corresponding label.
Clue Generation For a given training example <text> paired with the label word <label-word> (e.g., positive), we ask LLM to generate clues that indicate the label: List CLUES (i.e., keywords, phrases, contextual information, semantic meaning, semantic relationships, tones, references) that support the sentiment determination of the input (limit to 15 words).

INPUT: <text> SENTIMENT: <label-word>
Reasoning Generation Based on clues generated clues, the input, and the label, we ask LLMs to generate reasoning details5 : Based on the input and clues, articulate the diagnostic reasoning process that supports the sentiment determination of the input.INPUT: <text> LABEL: <label-word> CLUES: <clues> REASONING: Given the generated clues and reasonings for all training examples, at test time, when K-nearest examples are selected demonstrations, its corresponding clues and reasons are concatenated to the demonstration.
In this way, each demonstration example is composed by a (text, clues, reasons, golden label word) pair.Examples for prompts with clues and reasons are shown in Figure 4.In this way, for a test example, by following the format of demonstrations, the LLM will first output clues, then reasons, and at last decisions.

Voting
Unlike conventional discriminative models for text classification, which generate deterministic results during inferences, LLMs for in-context learning are generative models and generate distinct textual responses with diverse sampling strategies in multiple runs.We consider the following voting strategies in the paper: • Majority Vote: the final result is the most frequent prediction among multiple runs.• Weighted Probability Vote: the final result is the one with weighted summed probability from multiple runs.

Experiments
In We conduct experiments on five widely-used datasets, including SST-2 (Socher et al., 2013), R8, R526 , AGNews (Zhang et al., 2015) and Movie Review (MR) (Pang and Lee, 2005).More details of the benchmarks and low-resource datasets can be found in Appendix D.
For zero-shot and few-shot experiments, we use InstructGPT-3 (Ouyang et al., 2022) (text-davinci-003, 175B) as the backbone.Due to the input token limitation, we use k = 16 for few-shot setups.Prompts on the five datasets are shown in Appendix H. Model hyper-parameters can be found in Table 13 7 .We use Vanilla to denote the conventional ICL approach where LLMs are prompted to generate labels, use CoT (Kojima et al., 2022)  that mimics the chain-of-thought strategy and CARP to denote the proposed method.

Models for Comparison
Supervised models are trained on the trained set naturally constitute baselines to compare with.We use the six models, including RoBERTa-Large, RoBERTa-GCN, DeBERTa, XLNet, GCN-SB, and VLAWE.Details of the models and hyperparameters are shown in Appendix G.2: Few-shot Setup For demonstration sample strategies in the few-shot setup, we consider the following strategies for comparison: (more details can be found in Section 3.

Results on the full training set
Experimental results are shown in Table 1.As can be seen, performances of few-shot setups consistently outperform zero-shot setups.In terms of sampling strategies in the few-shot setups, we observe that simcse KNN-sampler outperform random sampler, illustrating the importance of adding demonstrations that are relevant to the test input in the few-shot setup.We also observe that FT KNN-sampler consistently outperforms simcse KNN-sampler.This shows that, the finetuned model, which takes the advantage of the full training set, serves as a better retriever for taskspecific demonstration retrieval than the generalpurposed SimCSE retriever.
For different reasoning strategies, we first observe that the CoT strategy outperforms the vanilla strategy, which straightforwardly asks LLMs to generate results without further reasoning steps.CARP consistently outperforms CoT across all benchmarks, i.e., +1.48, +0.97, +2.76, + 3.29, +0.47 respectively on SST-2, AGNews, R8, R52 and MR datasets.This demonstrates the necessity of building models with complex linguistic phenomena involved in text classification, and the effectiveness of CARP in doing this job.
Compared with supervised learning baselines, we find that the vanilla model using LLM underperforms supervised baselines, while fewshot CoT is able to obtain slightly worse or comparable results agains supervised baselines.Notably, single CARP outperforms fine-tuned RoBBERTa on all benchmarks.Using WP voting strategies, CARP yields new SOTA performances on four out of the 5 datasets, 97.39 on SST-2 (+1.24), 96.40 (+0.72) on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s.93.3).

Results on low-resource settings
To estimate low-resource circumstances, we sample n = {16, 128, 256, 512, 1024} instances for each class as low-resource setups.Experimental results are shown in Table 2.As can be seen, when the training set size is extremely small (i.e., 16 or 128 sentences), and the performance of the supervised model is far below CARP.Even with only 16 examples to train on, the accuracy of CARP of SST-2 already around 90%, whereas supervised models' performance is similar to random guess.This demonstrates the strong generalization ability of CARP in the low-resource setup.As we anticipated, the kNN search efficiency improved at a faster rate as the amount of the training data increases; Enlarging the training dataset increases the chances that the chosen examples will correspond to the input, resulting in improved results.Specifically, using 16 examples per class, CARP achieves comparable performances to supervised models with 1,024 examples per class; using 512 instance per class annotation data, CARP achieves comparable performances to supervised models trained on the full set.

Domain Adaptation
It is unclear whether training models on the specific dataset for retrieving demonstrations is essential.In this subsection, we conduct an analysis on using demonstrations from out-of-distribution datasets.
We use SST-2 and Yelp, and the task is to determine the positive or negative polarity of the given text.SST-2 and Yelp are from different domains: SST-2 are snippets from Rotten Tomatoes8 , whereas Yelp consists of product reviews from the online website.Experimental results are shown in Table 3. SST-2 train & SST-2 test means that demonstrations are from the SST-2 dataset and test is performed on SST-2 dataset; Yelp train & SST-2 test means demonstrations are from Yelp and test is performed on SST-2 dataset.We see a significant decrease (-7.2%, 95.99% v.s.88.78% ) in performance when switching SST-2 train to Yelp train using supervised RoBERTa, which illustrates that supervised models are very sensitive to the out-of-distribution data.On the contrary, we only observe a slight decrease in performance (-0.5%, 96.80% v.s.96.29%) when switching SST-2 train to Yelp-2 train on SST-2 test, illustration the greater capabilities of CARP on the domain adaptation situations.This means CARP is very robust when training and test are not from the same domain.

Impact of the number of demonstrations
We explore the effect of the number of demonstrations in prompts using SST-2 .Results for CARP using different sampling strategies are shown in Figure 2. As can be seen, performances improve as the number of demonstrations increases, which is in line with our expectation.Table 4: The effect of components on the SST-2 dataset with different strategies.

The effect of components in demonstrations
CARP uses (text, clues, reasons, golden label word) pairs as demonstrations.In this subsection, we exploit the influence of each component in (text, clues, reasons, golden label word) by removing it from prompts.Experimental results are shown in Table 4.As shown in Table 4, text in demonstrations has the greatest impact, followed by clue, reason and label.

The effect of different types of label words
Label words denote words generated by LLMs that indicate the label of the input.We explore the impact of using different kinds of label words: • Position index: e.g., one, two, three, etc.
• Flipped words: words that are contrary to original target meanings.e.g., "positive" to denote the negative polarity, "negative" to denote the positive polarity.• Random words: randomly choose words in the vocabulary.
9 GPT-3 generates the same label words for binary sentiment classification task.

Strategy
Label Words(+,-) CARP • Special tokens: tokens that do not have semantic meaning.They are independent of the input and added for a certain purpose.e.g., <cls>, <mask>.
Results are shown in Table 5.As can be seen, using annotation words as label words achieves the best performances.We also observe a significant performance decrease when flipped words are used as label words in demonstrations.

The effect of demonstration order
During experiments, we find that the ranking order of demonstration affect final results.In this subsection, we further investigate the influence of orders of demonstrations.
Orders the demonstrations we investigate include: • Random: randomly shuffle retrieved demonstrations.
• Low-to-High: demonstrations with lower similarity scores come first.Therefore demonstrations with higher similarity scores are placed closer to the test sequence, which is placed at the end of the prompt.
• High-to-Low: demonstrations with lower similarity scores are placed closer to the test sequence.
As shown in Table 6, the performance is sensitive the ordering of the demonstrations.The lowto-high ordering achieves the best performance compared to the random and high-to-low ordering.

Conclusion
In this paper, we introduce Clue And Reasoning Prompting (CARP) for text classification task.CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.More importantly, we find that CARP delivers impressive abilities on low-resource and domainadaption setups.In the future, we would like to explore CARP on more natural language understanding tasks.inference datasets, to determine whether the given "hypothesis" logically follows from the "premise".However, lacking annotation datasets, NLI-trained models can not generalize across multiple domains (e.g., opinion, reviews, news).Since then, we use 16-shot ICL with GPT-3 to evaluate whether the generated rationable explanations can be entailed from the input text.If the InstructGPT responds with "entailment", it denotes that the generated reasoning process is logic faithful with the text.Otherwise, it represents the reasoning process is not faithful to the text.We sample training instances from the SNLI dataset (Bowman et al., 2015) as demonstrations.And prompts are shown as follows: Given the premise and hypothesis, please justify whether the HYPOTHESIS can be entailed from the PREMISE.Please return yes or no.PREMISE: <text> HYPOTHESIS: <reasoning-process> Evaluation results are shown in Table 9.As can be seen, the reliability percentages for SST-2 and R5 are higher than 95%.This indicates that it is feasible to use the model-generated reasoning process as part of the prompts to augment ICL performances.The perplexity of generated reasoning text is smaller than 4, which denotes that the generated reasoning text is fluent.And scores of logic faithful are larger than 93%, which is in line with our expectation that LLMs can generate reasonable explanations.

G.3 The influence of hyper-parameters
We investigate the effect of model hyperparameters including temperature, frequency penalty.We conduct experiments with Instruct-GPT3 on the SST-2 dataset.
Temperature The temperature τ controls the generated text variety when another hyperparameter top p =1.More higher τ , more variety is introduced.When τ is close to 0, the model generates the same result with the greedy decoding

Figure 1 :
Figure 1: Examples of CARP prompts under zero-shots and few-shot settings.Comparisons of different prompts can be found in Appendix H.
on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s.93.3).More importantly, we find that CARP delivers impressive ability on low-resource and domain adaptation setups with orders of magnitude fewer training examples.Specifically, CARP achieves comparable performances with 16 examples per class to supervised models trained on the full training set containing more than 1 thousand examples per class.This demonstrates the capabilities of CARP in real-world text classification cases where training data is limited.

Figure 3 :
Figure 3: Examples of zero-shot prompting methods for the text classification task: (a) represents for the vanilla prompting method; (b) denotes for the Chain-of-Thought (CoT) (Kojima et al., 2022) prompting method; c represents for the proposed CARP prompting method.

Table 1 :
Accuracy performances of different settings on benchmarks.We report mean results over 5 runs.The GPT-3 denotes text-davinci-003.In few-shot experiments, we sample 16 annotated examples (k=16) for every test instance.* indicates existing SOTA results."WP Vote" denotes weighted probability vote.
better mimics real-world situations where training data is limited.For the full training setup, we follow the standard train/dev/test split.For the lowresource setup, we randomly sample n instances per class (n in {16, 128, 256, 512, 1024}) from the benchmark training set.The sampled subset forms a new training set to test different models' abilities in the low-resource situations.During experiments, we train models/demonstrations with the new training set.
to denote the baseline

Table 6 :
Accuracy scores on SST-2 when assembling demonstrations with different ranking strategies. .

Table 7 :
(Tang et al., 2015)he MR contains reviews of films for determining whether a sentiment is either positive or negative.The corpus has 10,662 reviews.We follow(Tang et al., 2015)and use the same train/test split.Accuracy performances of different settings on test subsets (results are over 5 runs).GPT-3 denotes text-davinci-003.In few-shot experiments, we sample 16 annotated examples (k=16) per prompt."MJ Vote" is short for majority vote."WP Vote" denotes weighted probability vote.

Table 8 :
Label words and results on the SST-2 dataset with different strategies.

Table 9 :
Results for evaluating the quality of generated reasoning explanation.We sample 500 (text, reason) instances for SST-2 and R8.

Table 11 :
Dataset Subsets method.To exploit the effect of temperature τ , we set τ from 0 to 1.0.Experimental results are shown in TableG.3.We tokenize the response text with GPT-Tokenizer 11 and then count the number of tokens.