Tree Prompting: Efficient Task Adaptation without Fine-Tuning

Prompting language models (LMs) is the main interface for applying them to new tasks. However, for smaller LMs, prompting provides low accuracy compared to gradient-based finetuning. Tree Prompting is an approach to prompting which builds a decision tree of prompts, linking multiple LM calls together to solve a task. At inference time, each call to the LM is determined by efficiently routing the outcome of the previous call using the tree. Experiments on classification datasets show that Tree Prompting improves accuracy over competing methods and is competitive with fine-tuning. We also show that variants of Tree Prompting allow inspection of a model's decision-making process.


Introduction
Pretrained language models (LMs) have made remarkable progress in recent years (Vaswani et al., 2017;Brown et al., 2020;OpenAI, 2023), but their large size makes them difficult to fine-tune with gradients for specific downstream tasks.As such, prompting has become the main interface for applying pretrained language models (LMs), where task-specific instructions are provided to guide an LM's behavior.The most common way to adapt LMs is to use few-shot in-context examples, where input-output pairs are shown to the model.
Yet, few-shot prompting has a clear downside.Prompt expressiveness is limited by the context length of the language model.This constraint prevents using more than a handful of examples for few-shot in-context learning, particularly in memory-constrained environments.If there is additional supervised data available for a task, users need to either ensemble together many prompts or back off to alternative LM fine-tuning approaches.
1 *Equal contribution.Scikit-learn-compatible API for using Tree-Prompt is available at github.com/csinva/treeprompt.In this work, we propose Tree Prompting as an alternative method for incorporating task supervision.The key idea is to use training data to form a decision tree based on simple prompt-LM calls, with each prompt determined by the outcomes of previous calls.The method does not change the parameters of the language model, but instead uses its outputs to determine an effective tree structure.To determine the prompts used at each node of the decision tree, we propose a simple bagginginspired approach that samples few-shot examples to find the most informative prompt (see Fig. 1).To convert LM outputs into split features for decision path determination, we consider both using a pre-defined verbalizer (Hu et al., 2022) and a more expressive kNN Prompting approach (Xu et al., 2023).To learn the tree structure, we employ a classic decision tree learning algorithm (Breiman et al., 1984).The constructed tree is a sparse representation of the fine-tuning data, incorporating a large number of few-shot examples, but only requiring a constant number of LM calls for inference.
Tree Prompting offers several advantages over existing prompting approaches.It allows users to easily incorporate large supervised training datasets without requiring larger contexts.It also allows experts to examine the decision-making process underlying a prediction in detail, which can be improved by combining with prompt generation methods.Finally, Tree Prompting can be adapted to be compatible with many existing LMs that are only accessible to the public via API calls.We demonstrate these advantages in experiments on multi-class classification benchmarks.

Background: Decision Trees
Decision trees are a classic model for classification and regression.They provide a graphical, intuitive model of decision-making, based on a cascading series of binary decisions2 .At each node in the tree, a decision is made based on a single feature of the input, which leads to the next node, and ultimately to a leaf node representing a prediction.
Learning Decision trees are constructed greedily in a top-down manner, starting from the root node, where all training data (x, y) and features ϕ(x) ∈ {0, 1} d are available.At each node, a feature that best splits the dataset into two subsets is chosen.The "best split" is determined by a criterion that measures the quality of a split.A commonly used criterion is the Gini impurity from the CART algorithm (Breiman et al., 1984).The selected feature creates two child nodes, each containing the subset of data that satisfies the respective split condition.This process is repeated recursively for each child node with the corresponding subset of the data until a stopping condition is met3 .Each leaf node in the final decision tree represents a decision (such as a class label prediction), determined by the majority class of the instances in the leaf.
Inference A decision tree makes predictions on unseen data by traversing the tree from the root node to a leaf node.Starting from the root, the feature value of the example corresponding to the split feature at the current node is used to determine whether the left child or the right child node is visited next.This process is repeated until a leaf node is reached.The class label associated with this leaf node is then used as the prediction.

Tree Prompting
Tree Prompting utilizes decision trees as a method of adapting LMs to specific tasks without finetuning the model.Assuming access to a set of textlabel pairs (x, y), the goal is to determine a tree to best classify this data.The algorithm then proceeds in a top-down manner, where at each node, it selects the best prompt based on the chosen method for finding prompt candidates and constructing split features.
However, unlike standard decision trees, we do not have access to a predetermined set of features ϕ(x).Tree Prompting instead constructs this feature function dynamically by constructing prompts.The value of a feature ϕ i (x) is determined by running a prompt through the LM and mapping its response to a binary value.
A major benefit of utilizing decision trees in this setting is their efficiency at inference time.Constructing the tree lets us compactly represent a large amount of task-specific training examples.If each ϕ i requires running one prompt, there are 2 D features.At inference time, we only need D prompt calls to classify a single datapoint.
Our primary approach to find prompts for features ϕ i (x) is to select random few-shot examples drawn from the training data, as shown in Figure 1.We take inspiration from bagging approaches (Breiman, 1996) that combine random training samples to produce complementary parallel models.By sampling random x, y pairs from the task training data and passing them to an LM, we are effectively bagging small training splits.Each prompt is constructed by alternating classes with their corresponding labels in a templated form.
Once a prompt is constructed, the feature value is set using a pre-defined verbalizer to transform the LM's output (Hu et al., 2022).A verbalizer is a function that maps the LM's output probabilities into a discrete decision.A simple implementation of a verbalizer is to determine whether the predicted probability for the token Yes/No is higher.In this work, we experiment with two different verbalizers: the first maps the logits to class labels (such as Positive/Negative for binary sentiment classification), and the second more generic verbalizer maps the logits to Yes/No.
When the output logits of the LM are inaccessible 4 , we can discretize the LM's outputs into "yes" "no" Figure 2: Extension to the Tree Prompting approach using instruction prompts on the Emotion dataset.Each path represents a decision sequence, and colors correspond to different emotions (classes).The line width indicates the number of instances within a particular class.As the decision process advances down the tree, the classes get separated.
categories defined by the verbalizer using word matching.With few-shot prompts, large LMs have empirically been found to respect the template format and output only labels that they have seen in the demonstrations most of the time (Brown et al., 2020).

Extensions
Instruction Prompts To leverage expert insights on specific tasks, we can instead use human-curated prompt candidates (Bach et al., 2022) as shown in Fig. 2. To diversify and enrich the pool of prompt candidates, we leverage the capabilities of GPT-3.5 to generate paraphrases of the original prompts.This method provides the ability to incorporate domain-specific knowledge, and the prompts are more interpretable compared to random few-shot examples.However, it might be less adaptable to novel or unique task specifications compared to the other automatic prompt candidate generation methods.
Dynamic Prompts Instead of pre-constructing random prompt-based features, we can generate dynamic prompts while building the decision tree.At each node, we conduct a discrete prompt search to identify a list of prompt candidates that best explain the subset of data at this node.The prompt that best splits this subset into two further subsets is then selected.The prompt search algorithm used in this paper is iPrompt (Singh et al., 2023b), which employs an LM to generate potential prompt candidates, ranking them based on how well they explain the data.Dynamic prompts offer enhanced flexibility and adaptability, at the cost of additional computation.
kNN Prompting Features As a more expressive alternative to predefined verbalizers, we consider the kNN Prompting approach (Xu et al., 2023).kNN Prompting employs the label of the nearest neighbor in an anchor set as the split feature, with the distance measured in terms of the KL divergence between output probabilities.This approach allows for the use of a large number of examples at each node, extending beyond the limitations imposed by the restricted context window size, making it more expressive.A downside of this approach is its dependence on access to the LM's output logits.Moreover, as multiple prompts are utilized at each node of the decision tree, this can compromise the interpretability of the model.Tree Ensembles Trees generated via Tree Prompting can be used to construct typical tree ensembles such as random forests (Breiman et al., 1984) or gradient-boosted trees (Freund et al., 1996) by using a Tree Prompting tree as the base estimator.This incurs very little overhead when using a fixed set of prompts, as the split features can be shared across all trees after being computed once.
Baselines We compare our approach Tree Prompting, TreePrompt, to standard fine-tuning.
In addition, we also compare against a conventional prompting baseline FSPrompting, which directly uses few-shot example demonstrations as the prompt.We also compare the performance of our ensembling approach, TreePrompt Ens5 , with two baseline ensembling strategies: Greedy, which adds prompts to an ensemble in order of their crossvalidation accuracy), and Boosting, which adds prompts to an ensemble using AdaBoost (Freund and Schapire, 1997;Hou et al., 2022;Pitis et al., 2023).

Classification Accuracy
Our main results are summarized in Table 1.The table compares Tree Prompting across multiple language model sizes to other few-shot prompting and ensembling approaches, as well as to gradientbased fine-tuning.Approaches are allotted a maximum of 40 LM inference calls per example.
Results show that Tree Prompting outperforms basic few-shot prompting and also ensemblingbased approaches across model sizes and almost all datasets.The performance difference is particularly large across smaller model classes.For instance, while FSPrompting averages an accuracy of 44.3% with GPT-2 Small, Tree Prompting elevates this to 60.5%.Tree Prompting can also be ensembled, which produces accuracy improvements at the cost of more LM calls.
We also compare Tree Prompting to gradient based fine-tuning, particularly on GPT-2 Large.Results show that Tree Prompting is less stable than fine-tuning, performing poorly on some tasks, but outperforms it on 5 of 10 tasks.This result shows that Tree Prompting can learn well from task supervision at the cost of additional runtime queries.(Tree Prompting could likely perform better compared to fine-tuning if we increased the maximum number of prompts beyond 40.) Relative to the baselines, Tree Prompting is typically able to outperform them all while making fewer queries than Greedy and Boosting.Tree Prompting makes large improvements over fewshow prompting in most cases, even when the model size is large.We observe a failure of the boosting strategy when the number of classes is large (specifically for 14-class DBPedia).Tree Prompting with gradient boosting generally gives an increase at performance at the cost of a 5.7-times increase in queries.

Inference Efficiency
As computing outputs from language models can be costly, particularly with large LMs, the efficiency of Tree Prompting at inference time is crucial.In Fig. 3, we plot test accuracy against the number of language model evaluations per example (#LM Calls) to gauge this efficiency6 .Tree Prompting frequently surpasses competing ensemble strategies in performance under the same number of LM calls, indicating an improvement in efficiency.This gain is more significant for multiclass datasets, such as Emotion and Financial phrasebank (FPB).
To establish an upper bound of accuracy, we consider Tree Prompting ensembling.This approach generally achieves the best test performance across all methods, although it also demands more LM calls than a single tree (up to 40 calls).

Interpretability and Dynamic Prompts
A practical benefit of decision trees is increased interpretability.Each node of the decision tree can be inspected, offering insights into the decisionmaking process when predicting the label of a given input.Our few-shot approach for Tree Prompting  is challenging to interpret, but we can instead use interpretable prompts that are human-curated or dynamically constructed.Fig. 2 demonstrates an instance of a decision tree learned from human-curated prompts on the Emotion dataset, where different colors represent the true labels of the data.At the root node, a high-level query is posed regarding whether the tweet's underlying emotion is love.Deeper in the tree, more granular questions are presented, e.g.whether the sentiment of the sentence is anger.
Dynamic prompts offer an additional advantage over human-curated prompts; they are capable of better reflecting the specific subset of data at each node, making the decision process more aligned with the data distribution at each tree node.Fig. 4 shows a tree learned using iPrompt to create dynamic prompts at each node of the tree.Prompts are suggested by GPT-4 and reranked according to the iPrompt algorithm with the verbalizer Yes/No corresponding to the positive/negative classes of the MR dataset.

Comparison with kNN Prompting
Nonparametric methods like kNN Prompting (Xu et al., 2023) can be employed to improve model expressivity, which allows using multiple prompts per node and avoids the reliance on pre-defined verbalizers.Table 2 provides a comparison between Tree Prompting and kNN Prompting.In this comparison, Tree Prompting uses kNN Prompting predictions as split features7 .The results show that Tree Prompting outperforms vanilla kNN Prompting on most datasets, potentially due to its added flexibility of partitioning the input space using the decision tree, although it underperforms on three of the smaller datasets CB ( 250

Comparison to Larger LMs
Tree Prompting allows enhancing the performance of small LMs to match the performance of large LMs, as shown in Table 3.For these experiments instead of few-shot prompting we use instruction prompts curated from PromptSource (Bach et al., 2022) 8 .In this setting, even GPT-2 Small paired with Tree Prompting Ensemble is competitive against , outperforming it on two datasets (FPB and MR), albeit being slightly worse on the other two datasets (IMDB and SST2).With the larger LM GPT-J, Tree Prompting outperforms GPT-3 with conventional prompting across all datasets, demonstrating the potential of using a smaller model in a decision tree repeatedly to outperform a larger model, which might be useful in resource-constrained scenarios.Table 5: Comparative results using different prompt sources.We use GPT-2 Small with class names as the verbalizer, limiting the LM to a maximum of 5 average calls during inference.

Verbalizer Sensitivity
Table 4 shows the robustness of different approaches when employing a generic Yes/No verbalizer versus a class-name verbalizer.The results show that Tree Prompting consistently outperforms the baseline regardless of the verbalizer used, delivering decent performance even when using the generic Yes/No verbalizer.This feature could be useful in applications where class names are not meaningful words, such as in distinguishing between texts generated by different decoding settings (Naseh et al., 2023).Table 8 in Appendix A.3 shows full performance sensitivity results across different settings for the underlying LM, verbalizer, and source of prompts.

Prompt Source Sensitivity
Table 5 examines the sensitivity of various approaches to the source of prompt candidates.The comparison between using instruction prompts and few-shot prompts demonstrates that Tree Prompting consistently outperforms baselines regardless of the source of prompt candidates.It's worth noting that instruction prompts generally result in better performance than few-shot prompts, corroborating previous findings that in-context learning with a single prompt can work as well as multiple data demonstrations (Le Scao and Rush, 2021).However, curating instruction prompts requires extra human effort, since new prompts must be written for each new dataset.

Related Work
Prompting Language Models The rise of large language models (LMs) has led to a surge in the development of effective prompting methods (Strobelt et al., 2022;Lu et al., 2022;Bach et al., 2022;Logan IV et al., 2022;Zhong et al., 2022;Singh et al., 2023b).Building on top of these methods, emsembling techniques for averaging multiple LM calls have shown that they often improve performance (Jiang et al., 2020;Zhang et al., 2023a), e.g.boosting (Hou et al., 2022;Pitis et al., 2023).Chain prompting (Wang et al., 2022;Press et al., 2022;Chase, 2023;Rush, 2023) is a widely used method that divides complex tasks into manageable subtasks, linking these via prompt-LM calls.This approach has proven effective across various applications, aligning with our intuition underlying this work: an LM can handle individual steps of a task more accurately than executing the task in full (Ma et al., 2023;Madaan et al., 2023;Zhang et al., 2023b).However, while chain prompting links prompt-LM calls, our approach organizes them within a decision tree, learning the tree structure and selecting appropriate prompts for each node.
Frugal GPT (Chen et al., 2023) also bears relevance to our work, proposing a cascade of LMs that stops when an intermediate output is considered reliable, resulting in better computational efficiency.Viewed from the perspective of decision trees, this approach resembles a right-branching decision tree.
Concurrent to our work, Tree of Thoughts (Yao et al., 2023;Long, 2023) organizes LM-generated "thoughts" within a tree structure for solution search.While we also use a tree structure, our aim is to partition the input space to simplify the LM's tasks at lower tree levels.We search the tree's structure and the prompt at each node during training, while keeping these elements static during inference.In contrast, Tree of Thoughts adjusts node prompts dynamically based on upper-level results.This sets it apart from our approach, where prompts remain constant post-training.Collectively, these works demonstrate the growing interest in merging tree structures with LMs for task decomposition, albeit with varied focuses and methodologies.
Decision Tree Applications Dating back decades, decision trees have been a prevalent choice in the realms of classification and regression problems (Costa and Pedreira, 2022).In the field of natural language processing, decision trees and their ensemble variants such as Random Forest (Breiman, 2001), Gradient-boosted Trees (Freund et al., 1996), XGBoost (Chen andGuestrin, 2016), andBART (Chipman et al., 2010) have found use in areas like part-of-speech tagging (Magerman, 1995), syntactic parsing (Collins, 1997), and text classification (Sebastiani, 2002;Singh et al., 2023a).However, these studies predominantly utilize pre-defined textual features within their decision tree frameworks, contrasting our approach where the decision tree is used to direct the language model's behavior.
Decision Trees for Interpretability Decision trees have also been applied to increase the interpretability of neural models.For example, Wan et al. (2021) used a decision tree structure where each node is a neural classifier for image classification.Zhang and Zhu (2019) learned a decision tree to explain the decisions made by an image classifier post hoc.While these works primarily target visionbased applications, we adopt a similar strategy for natural language processing, where each node in our decision tree embodies a distinct prompt-LM call.Furthermore, our dynamic prompt setting enables the concurrent learning of prompts and the decision tree structure, distinguishing our method from conventional decision tree applications that function within a pre-defined feature space.

Conclusions and Future Work
We introduce the Tree Prompting approach, a use of decision trees for task adaptation.Experiments demonstrate that Tree Prompting can offer improved performance across various text classification tasks while still remaining efficient during inference.On many tasks, the model is competitive with gradient fine-tuning.Additionally, the approach can be used with dynamic prompt creation to yield interpretable models.
Our results suggest a future direction of exploring a flexible and modularized assembly of models.One exciting direction is to extend Tree Prompting to generalize to tasks beyond text classification, using previous outputs to guide subsequent prompts and LMs.Further exploration could involve extending Tree Prompting to jump across nodes in the tree (similar to Long (2023)) or introduce cycles in the tree (similar to Besta et al. (2023)), and ultimately developing a program of prompts by navigating various nodes in a decision tree as though calling different functions.Another direction could explore incorporating different criteria into the tree-building algorithm, e.g.fairness (Jo et al., 2022), sparsity (Hu et al., 2019;Tan et al., 2022), or smoothness (Agarwal et al., 2022).

Limitations
Sample Complexity While Tree Prompting's adaptability and flexibility are its strengths, they also contribute to its higher sample complexity.As shown in Sec.6.3, Tree Prompting lags behind fewshot prompting in low-data environments.Decision trees inherently risk overfitting, particularly when dealing with numerous features.This shortcoming can be partially offset through the use of larger training sets, and by restricting the tree's size in relation to the training set size.
Training Cost Although Tree Prompting demands fewer LM calls during inference compared to analogous techniques, its training process, which involves learning the decision tree, requires computing prompt features for every example in the associated data subset at each node.This can be resource-intensive for large LMs.Additionally, when paired with dynamic prompts that leverage automatic prompting methods (which are typically computation-heavy), the training process can be substantially expensive as each node necessitates running the autoprompting method once.
Interpretability While decision trees are typically celebrated for their interpretability, the interpretability of Tree Prompting is bounded by the nature of the prompts and the verbalizer.Specifically, when employing a pre-defined prompt, its interpretability may not be as intuitive as that of dynamic prompts.If the prompt itself (such as when using few-shot demonstrations) lacks interpretability, the entire decision tree's interpretability is likely to be compromised.

Figure 1 :
Figure1: Illustration of Tree Prompting.At each node of the decision tree, a subset of training data is used to prompt the LM to partition the input space into subregions.This process is repeated until a classification decision is made at a leaf node.

Figure 3 :
Figure3: Performance as a function of the number of LM evaluations per example (#LM calls).We use GPT-J as the base LM and class names as verbalizers.GBDT, gradient-boosting tree using Tree Prompting as the base classifier, fitted to a maximum of 40 LM calls provides an upper bound of accuracy we can get on individual datasets.

Figure 4 :
Figure4: Tree Prompting tree learned using dynamic prompts on the MR dataset.We use GPT-4 for prompt generation in AutoPrompting and GPT-J as the base LM in Tree Prompting.

Fig. 5
Fig. 5 visualizes the performance of Tree Prompting in relation to the fraction of training samples used for training.When compared to baseline ensembling techniques, Tree Prompting sometimes underperforms in low-data regimes (on FPB, IMDB, and MR), but it eventually outperforms baselines as more training data is available.

Figure 5 :
Figure 5: Accuracy plotted against the fraction of samples used for training.The performance improvement facilitated by Tree Prompting (shown here with gradient-boosting) becomes more noticeable as the number of training samples escalates.We use GPT-2 with Instruction prompts and a set of 10 prompts for this visualization.

Figure 6 :
Figure6: Example tree for the MR dataset.We use GPT-J and search for 10 instruction prompts.Class names (positive/negative) are as the verbalizer.

Table 1 :
Main results.ICL: In Context Learning.ICL Prompting and ICL Prompting Ensemble use 128 examples per class to construct the prompt.

Table 2 :
Comparison between Tree Prompting and kNN Prompting.Both approaches use GPT-2 Large as the base LM.Tree Prompting uses predictions from kNN Prompting to construct split features.Tree Prompting results are averaged over 5 random seeds.

Table 3 :
Tree Prompting with supervision achieves comparable accuracy to GPT-3 zero-shot and supervised autoprompting.Tree Prompting uses instruction prompts, class names as the verbalizer, and fits gradient-boosted trees with up to 40 prompts.Averaged over 3 random seeds.

Table 4 :
Accuracy with different verbalizers.We employ GPT-2 Small as the LM, limiting to a maximum of 5 average calls during inference.

Table 6 :
CB CR RTE TREC Emotion FPB SST2 SUBJ MPQA MR AGNews DBPedia IMDB Dataset statistics Bears Claw Back Into the Black (Reuters)."Reuters -Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again."business CB Premise: "Do you mind if I use your phone?"Ronni could see that Guido's brain was whirring.Hypothesis: Guido's brain was whirring entailment CR i didn 't have any major problems installing this software .positive DBPedia Geoffrey D. Falksen (born July 31 1982) is an American steampunk writer.artist MR the film is flat .negative RTE Sentence 1: No Weapons of Mass Destruction Found in Iraq Yet.Sentence 2: "Weapons of Mass Destruction Found in Iraq.not_entailment TREC What 's known as The queen of Drinks ?entity FPB According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .neutral IMDB would put this at the top of my list of films in the category of unwatchable trash![...] negative Emotion i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake sadness