SEER : A Knapsack approach to Exemplar Selection for In-Context HybridQA

Question answering over hybrid contexts is a complex task, which requires the combination of information extracted from unstructured texts and structured tables in various ways. Recently, In-Context Learning demonstrated significant performance advances for reasoning tasks. In this paradigm, a large language model performs predictions based on a small set of supporting exemplars. The performance of In-Context Learning depends heavily on the selection procedure of the supporting exemplars, particularly in the case of HybridQA, where considering the diversity of reasoning chains and the large size of the hybrid contexts becomes crucial. In this work, we present Selection of ExEmplars for hybrid Reasoning (SEER), a novel method for selecting a set of exemplars that is both representative and diverse. The key novelty of SEER is that it formulates exemplar selection as a Knapsack Integer Linear Program. The Knapsack framework provides the flexibility to incorporate diversity constraints that prioritize exemplars with desirable attributes, and capacity constraints that ensure that the prompt size respects the provided capacity budgets. The effectiveness of SEER is demonstrated on FinQA and TAT-QA, two real-world benchmarks for HybridQA, where it outperforms previous exemplar selection methods.


Introduction
Hybrid documents, which combine tables and text paragraphs, are prevalent in various industries such as finance, healthcare, and manufacturing.The development of question-answering systems capable of effectively handling these documents holds the potential to automate business processes and enhance accessibility to the information they contain.Several benchmarks anchored in the financial domain have been introduced for hybrid question answering (HybridQA) (Chen et al., 2021c;Zhu et al., 2021;Zhao et al., 2022).Figure 1 presents an example from the FinQA dataset.Despite ongoing progress, current HybridQA models have yet to achieve human expert performance (Zhu et al., 2021;Chen et al., 2021c).Recently, In-Context Learning (ICL) with Large Language Models (LLMs) has shown great performance on reasoning tasks.In this setting, a set of exemplars, i.e. training instances with their answers, are provided as part of the input prompt to assist LLMs in generating the correct answer.ICL is an inference time technique that keeps the LLMs parameters frozen (Brown et al., 2020).The performance of ICL depends heavily on the quality of the provided exemplar set (Liu et al., 2022a).Various strategies, ranging from random selection to similarity-based retrieval, have been proposed to tackle this problem (Liu et al., 2022b;Rubin et al., 2022;Li and Qiu, 2023;Lu et al., 2023).When selecting exemplars for HybridQA, special consideration must be given to the unique challenges of the task, including diverse reasoning chains, large context sizes (Chen et al., 2021c;Zhao et al., 2022), and the limited correlation between the question and its reasoning chain.In this work, we propose Knapsack Programs as a framework to model exemplar selection for ICL.Knapsacks are a family of Integer Linear Programs that search an optimal subset of items under linear constraints (Wolsey, 2020).For a given test instance, a Knapsack Program is solved to obtain the optimal exemplar set.This expressive framework allows balancing the diversity and similarity of the selected exemplars while controlling the prompt size with user-defined linear constraints.We introduce SEER, a novel method to select exemplars for HybridQA using Knapsack Programs.SEER reduces the candidate set with a nearest neighbor filtering, and leverages constraint modules to predict the attributes of the test instance.The attributes of a HybridQA instance are properties that influence the underlying reasoning chain, e.g., the modality (table, text, hybrid)

Related work
In-Context Learning LLMs have shown the ability to perform a wide range of tasks with only a few exemplars provided as a prompt while keeping all the model parameters frozen (Brown et al., 2020).However, the performance is highly dependent on the quality of the provided exemplars (Zhao et al., 2021;Lu et al., 2022).Hence, several approaches have been explored for exemplar selection, including nearest neighbor search (Liu et al., 2022a), reinforcement learning (Zhang et al., 2022;Lu et al., 2023), clustering (Zhang et al., 2023), search algorithms (Li and Qiu, 2023), and supervised learning (Rubin et al., 2022;Ye et al., 2023).Rubin et al. (2022) consider token capacity indirectly by constructing the largest possible prompt with the selected exemplars.In contrast to previous methods, we consider exemplar selection as an ILP Knapsack Program, which allows us to specify diversity-enhancing constraints, directly optimize the token capacity without simple heuristics, and leverage powerful solvers to find a performant exemplar selection.Hybrid Question Answering Chen et al. (2020Chen et al. ( , 2021b) ) introduced the task of HybridQA on opendomain Wikipedia pages.Later on, datasets based on real-world financial documents were introduced (Zhu et al., 2021;Chen et al., 2021c;Zhao et al., 2022;Chen et al., 2022b).Previous work has focused on improving the retriever-generator framework (Lei et al., 2022;Sun et al., 2022;Zhang and Moshfeghi, 2022).Chen et al. (2022a) use few-shot chain-of-thoughts to solve the task of HybridQA.However, they focus on improving the prompt format, while our work focuses on selecting good exemplars.Integer Linear Programming for NLP Integer Linear Programming has been used in many NLP tasks (Martins, 2014), including coreference resolution (Denis and Baldridge, 2007;De Belder and Moens, 2012), sentence compression (De Belder and Moens, 2010), dependency parsing (Riedel and Clarke, 2006), semantic role labeling (Roth and Yih, 2005), and translation (Germann et al., 2004).In this work, we introduce a novel application of ILP in NLP: exemplar selection for ICL.Furthermore, it is, to the best of our knowledge, the first time that the Knapsack family of ILP programs is used in NLP.

Integer Linear Programming
Linear Programming (LP) involves maximizing or minimizing an objective function while adhering to a set of constraints.The objective function consists of a weighted linear combination of variables, while the constraints are (in)equalities that involve linear combinations of these variables.These constraints serve to restrict the value ranges of the variables and capture their interaction effects (De Belder and Moens, 2010).Integer Linear Programming (ILP) is a subset of LP, wherein variables are constrained to take only integer values.ILP is divided into program families, one of which is the Knapsack.Given a set of items, the Knapsack's objective is to select the subset that maximizes the total value while remaining within the maximum capacity (Wolsey, 2020).More formally, the problem can be expressed as: where x i is a variable that takes value 1 when item i in set S is selected, 0 otherwise.w i and c i are parameters representing the value and cost of item i, respectively.C is the maximum capacity of the Knapsack.Depending on the setting, additional constraints and variables can be added to the basic Knapsack program template shown above.
The Knapsack is NP-hard, but several algorithms have been developed to find an optimal solution efficiently, including Branch-and-Bound and Cutting Planes (De Belder and Moens, 2010).Several solvers provide efficient implementations of these algorithms (Martins, 2014).

Challenges of Exemplar Selection for HybridQA
When solving HybridQA problems with ICL, the task is to predict an answer A given a question Q, a hybrid context consisting of text paragraphs P and a table T , and a set of n exemplars E = {e 1 , ..., e n } where e i is a tuple (Q i ,A i ,P i ,T i ).
A = argmax a P (a | Q, P, T, E) Prior studies (Liu et al., 2022a;Lu et al., 2023) have demonstrated that the thoughtful selection of the exemplar set E can improve and stabilize the performance of ICL over a random selection baseline.However, selecting the optimal exemplar set poses three challenges for HybridQA problems.First, there is a high diversity in the type of questions and in the approaches to solve them.The financial dataset FinQA for example contains more than 300 different numerical formulas.For questions asking to compute a "percentage change", the training set counts 12 unique formulas.Given this diversity, it is not possible to cover all possible reasoning chains with a single set of exemplars.This challenge is partially addressed by similarity-based exemplar selection methods (Liu et al., 2022a).Secondly, Figure 2 shows the additional challenge of the low correlation between the problem's question and attributes.This might result in prediction errors, as these problems seem semantically similar, however, they require different modalities and answer types.Thirdly, HybridQA problems, especially when dealing with real-world data like financial documents, involve large contexts.LLMs have a limit to the number of input and output text tokens they can process, whether it is due to organizational resource constraints or the inherent limitation of the LLM.Consequently, it becomes crucial to ensure that tokenized exemplars fit within the LLM's capacity while reserving enough tokens for generating the desired output text.In the following, we propose a new exemplar selection method that addresses those three challenges by modelling them as objectives and constraints of a Knapsack program.Notably, this work addresses explicitly the latter two challenges for the first time, contributing to the advancement of exemplar selection in this domain.

SEER
SEER generates Knapsack programs for exemplar selection in HybridQA.Given the training set and a test instance, SEER constructs a unique Knapsack program using nearest neighbor filtering and constraint modules.The selected exemplar set is the optimal solution to the Knapsack program.These exemplars, along with the test instance, are provided as prompts to an LLM for predicting the final answer.Figure 3 The program contains two or more diversity constraints.The structure of those constraints depends on the attributes predicted by the constraint modules.

LLM Code Generation
The ICL exemplars selected by SEER and the test instance are concatenated and provided as a prompt to an LLM.To remove irrelevant text paragraphs from the hybrid context in relation to the question, we employ a pre-trained text retriever.Recent studies have demonstrated the benefits of formulating answer derivation as Python code (Chen et al., 2022a;Gao et al., 2023;Mishra et al., 2022).Inspired by these approaches, we adopt a code generation formulation instead of text generation.The resulting code is executed using an external engine to derive the answer.Figure 1 depicts the conversion from the original text answer to a Python code answer.

Datasets
We evaluate SEER on two HybridQA datasets with real-world contexts: FinQA (Chen et al., 2021c) and TAT-QA (Zhu et al., 2021), both anchored in the financial domain.
FinQA comprises 8,281 problems, each containing a context with a table and multiple text paragraphs.The answers in FinQA are expressed in a domain-specific language as a sequence of operators with two operands each.These expressions are then translated to Python code using an automated script, with each operation on a separate line.Tables are linearized with " | " as column delimiter and " \n " as row delimiters.A prompt example is provided in Figure 1.This dataset includes a constraint module for predicting the modality.To take this into account, we allow some flexibility in the answer evaluation script.For further details on the evaluation procedure, refer to Appendix E.

Baselines
We compare SEER with other exemplar selection strategies for ICL.To disentangle the exemplar selection performance from errors related to constraint modules, we also include results obtained using ground truth problem attributes, denoted as SEER gold , which serves as an upper bound estimate for SEER's performance.
Although our primary focus is on ICL, numerous studies have explored fine-tuning approaches for HybridQA.We report the results of the SOTA finetuning approach.For a comparison of fine-tuning approaches with SEER, refer to Appendix F.

Implementation details
We use CODEX, a 175 Billion parameters LLM available through the OpenAI API2 (Chen et al., 2021a), as a backbone for SEER and all baselines.CODEX has demonstrated exceptional performance in code generation tasks (Chen et al., 2021a).For the computation of question and text paragraph embeddings, we employ Sentence-BERT models (Reimers and Gurevych, 2019) .Instruction for answer type prediction: "Does this question require to extract spans from the document, to count, or to perform an arithmetic reasoning?Answer by one of the following: span, multi-span, count, arithmetic.".We set the maximum token length L to the maximum capacity of the LLM minus the token lengths of the problem's context and the longest answer in the training set.This conservative estimate of L ensures that the exemplars, the problem's context, and its answer can all fit within the imposed token limit.We use the GUROBI solver3 to find the optimal Knapsack solution.The detailed parameter values are listed in Appendix D. Our experiments are run on a single NVIDIA GeForce RTX 3050 GPU, except for CODEX inferences, which are performed on the OpenAI servers.

Main results
Table 1 shows the main results, showcasing the superior performance of SEER compared to all exemplar selection baselines.Notably, SEER gold achieves better results than SEER by a margin of 0.4 to 1.5%, indicating that accurate attribute prediction by the constraint modules significantly contributes to the overall performance.We conclude that SEER's hedge over baselines is the direct consequence of the prediction and selection of correct attributes.Interestingly, the baselines struggle with instances where incorrect attributes are predicted and selected, indicating a correlation between the difficulty of a HybridQA instance and the challenge of predicting its attributes.Where no correct attribute is selected (IAS), the Fixed set baseline outperforms SEER and KATE by a slight margin.

Constraint modules results
Table 2 shows the performance of fine-tuned and ICL constraint modules on the dev and test sets.
On two out of the three attribute prediction tasks, the fine-tuned BERT module performs best.However, the ICL constraint module's performance is always close, making it a viable alternative when attribute labels are not available in the training set.
There is still room to improve the performance of constraint modules, as shown by the low precision of the "hybrid" modality in the confusion matrices in Figure 5.As a result, we decided to treat the "hybrid" modality as an "uncertain" modality attribute.Hence, the diversity constraints in the "hybrid" setting promote the usage of all three modalities.

Analysis
How sensitive is SEER to variations in constraint parameters α and β?
We analyze SEER's and SEER gold 's sensitivity to different (α,β) value pairs.By increasing α, we encourage the selection of exemplars that share the predicted attributes of the test instance.Conversely, increasing β promotes the inclusion of exemplars with different attributes, thus ensuring diversity and mitigating errors introduced by the constraint modules.We evaluate the EA and EM on the dev set for the pairs (50,25), (75,0), (75,25), (100,0).The results, depicted in Figure 6  highest performing configurations is less than 1%.Consequently, we conclude that SEER does not require extensive tuning of these two parameters to achieve satisfactory performance.SEER gold performs best when α is set to higher values and β is set to 0. As SEER gold leverages the ground truth attribute, there is no need to mitigate attribute errors with β.This finding aligns with our intuition that guiding exemplar selection based on problem attributes improves performance.

How does SEER perform under different token capacity budgets?
To evaluate the benefits of the double capacity constraint, we evaluate SEER under different token capacity budgets.While the default token capacity of CODEX is 4096, multiple reasons might lead to a reduction in that value, including using another LLM or financial and time constraints.
In the following, we consider capacities of 2048 and 1024 tokens.CODEX.This is particularly true for Counting problems, where 4 or more relevant exemplars are provided, but CODEX fails to make the correct count for the test instance.25% of the FinQA value errors result from a poor computation of the similarity with the candidate exemplars, based on semantically superficial elements.Furthermore, we observed that 17% of the FinQA operator errors are due to one missing reasoning step, e.g. the addition of a constant term.The analysis highlights directions for future work.(1) SEER could benefit from fine-tuning methods to compute the similarity weights (Rubin et al., 2022).( 2) SEER can be augmented with mechanisms that predict and control the required number of reasoning steps.

Conclusion
This paper investigates the problem of exemplar selection for ICL in HybridQA tasks.We propose ILP as a framework for exemplar selection and introduce SEER, a novel method based on Knapsack programs.While existing methods only focus on the high diversity of questions and how to solve them, SEER explicitly tackles two other key challenges of HybridQA exemplar selection in the form of integer linear constraints.Specifically, SEER focuses on the following challenges, the low correlation between the problem's questions and attributes on one hand and the large context required to solve these problems through diversity and capacity constraints.Diversity constraints enhance performance by selecting exemplars that share the same attributes as the test instance.Capacity constraints provide fine-grained control to the end-user over the token budget allocated to the prompt, ensuring sufficient tokens to generate the desired output.This level of control is beneficial in overcoming the limitations of the backbone LLM and dealing with financial constraints.Evaluation on two financial HybridQA datasets shows that SEER outperforms previous ICL exemplar selection methods, especially under limited token capacity.
For future research, we plan to explore additional attributes and seek ways to enhance the overall performance of the constraint modules.Leveraging the flexibility of the Knapsack framework, we intend to study the potential of incorporating constraints defined by end users, such as domain experts, to further refine exemplar selection.Additionally, SEER can be extended beyond the HybridQA setting to encompass other modalities like images and knowledge graphs, as well as other tasks, such as hybrid fact-checking (Aly et al., 2021), data-to-text generation (Parikh et al., 2020), and multimodal fraud detection (Wang et al., 2023).

Limitations
We identify two main limitations to the method introduced in this paper.Firstly, we assumed that we could select ICL exemplars from thousands of training instances with annotated textual reasoning.However, in real-world scenarios, obtaining these textual annotations is a time-consuming process that involves manual labeling.In a recent study by Su et al. (2023) A The SEER algorithm SEER is explained in pseudo-code in Algorithm 1.

B Diversity constraint templates
We present the potential configurations of diversity constraints, which are determined by the predictions of the constraint modules.There exists one configuration per possible attribute value.

D Implementation details
Table 4 presents an overview of the parameters employed in the components of the SEER framework.

E Complement on evaluation metrics
The ground truth answers of FinQA and TAT-QA have specific syntactic rules that require access to many training examples to be learned.Those rules are not related to the semantic correctness of the answer.ICL approaches are limited to a few exemplars.Hence, they have a limited ability to learn those syntactic rules.As a result, we provide some flexibility in the evaluation scripts.The following rules are applied: equivalence of answers written as percentage or decimal, removal of characters ($,"million","billion", ...), equivalence up to a rounding of 2 decimals, removal of trailing 0s after the comma.Examples are shown on Table 5 F Comparison with Fine-Tuned models Tables 6 and 7 list the best-reported performance of models that followed a fine-tuning strategy.SEER outperforms several models but lags in performance compared to the current SOTA.There are several trade-offs to consider between fine-tuning and ICL approaches.Fine-tuning requires updating the model weights, which costs time and resources.On the other hand, ICL models can be adjusted to

Figure 1 :
Figure 1: An instance of the FinQA dataset (Chen et al., 2021c).Text snippets of the text paragraphs are shown in the dashed box.
and the answer type (span extraction, arithmetic reasoning, counting).By leveraging constraint modules, SEER shapes the Knapsack structure to prioritize the selection of exemplars that share similar attributes with the test instance.The contributions of this work are as follows: (1) we introduce Knapsack Programs as a framework for ICL exemplar selection.(2) We propose SEER, a novel exemplar selection method for In-Context HybridQA.(3) We address all three challenges of HybridQA exemplar selection at the same time with fine-grained token budget constraints and attribute-guided selection.Extensive evaluation of two real-world HybridQA benchmarks shows that SEER outperforms state-of-the-art exemplar selection methods, especially under restricted token capacity budgets.

Figure 2 :
Figure 2: Four problem instances from TAT-QA.Similar questions do not always share similar problem attributes.

Figure 3 :
Figure 3: Overview of SEER's architecture to select the optimal exemplar set for a HybridQA problem.
Equations are written as a one-line variable assignment.A prompt example is provided in Appendix C. Unlike FinQA which only contains arithmetic problems, TAT-QA includes other answer types such as (multi-)span extraction and counting problems.Consequently, TAT-QA has two constraint modules, one for modality and one for the answer type.The evaluation of the datasets is based on their respective metrics: execution accuracy (EA) and program accuracy (PA)(Chen et al., 2021c) for FinQA, and Exact Match (EM) and numeracy-focused F1(Dua et al., 2019) for TAT-QA.EA and EM assess the correctness of the final answer, while PA ensures that the generated code is mathematically equivalent to the ground truth derivation.The numeracy-focused F1(Dua et al., 2019) computes the F1 score over the bag-of-word representation of the generated answer.For TAT-QA, we exclude the sub-task of scale prediction and reserve it for future work in multi-task ICL HybridQA.The LLM has little exposure to syntactic conventions unrelated to correct reasoning, such as determining the appropriate scale to represent a percentage ([0,1] or [0,100]).

Figure 5 :Figure 6 :
Figure 5: Confusion matrices of the best constraint modules on the test sets.

Figure 9 :
Figure 9: TAT-QA instance example set, composed of the k nearest neighbor of the test instance's.The Knapsack has a double capacity constraint.Firstly, M is the maximum allowed number of exemplars.Secondly, L is the maximum combined length of the exemplars l i in number of tokens.The value of L depends on the backbone LLM.
(Liu et al., 2022a)w of SEER's methodology.Similarity computation involves calculating cosine similarity between pairs of HybridQA problems' questions in the embedding space.The resulting similarity values serve as coefficients in the objective function.To ensure accurate comparisons, preprocessing is applied to remove noise.(1)Allquestions are lowercased.(2)Allpunctuation is removed.(3)Dates,numerical values, locations, and companies are replaced by their NER tags.Nearest Neighbor filtering involves applying an initial filter(Liu et al., 2022a)to identify the k candidates from the training set that exhibit substantial surface similarity with the test instance, thus narrowing down the search space.Constraint modules predict an attribute of the test instance.We define an attribute as a characteristic of the reasoning chain of a HybridQA problem.Attributes include modality, answer type, and number of reasoning steps.Inferring these attributes involves a standard classification task, where the question and hybrid context are provided as input, and the output corresponds to one of the attribute values.The task can be addressed through finetuning or ICL.A SEER Knapsack is uniquely defined for a test instance by the combination of the similarity weights and the predicted attributes.As an illustration, the base template for Knapsack programs with predicted attribute "modality:table" is formulated as follows, where variable x i takes value 1 if instance i from the candidate set S is selected as an exemplar, 0 otherwise.The candidate set S is a subset of the original training Lu et al. (2023)2a)ine employs the same set of exemplars for every test instance.Specifically, we randomly sample 20 groups of 4 and 8 exemplars from the training set for FinQA and TAT-QA, respectively.The group exhibiting the highest performance on the dev sets is selected.KATE(Liu et al., 2022a)selects the k nearest neighbors to the test instance.It corresponds to SEER without token capacity and diversity constraints.Diverse KATE, an extension of KATE, divides the training set into two subsets, one for text problems and one for table problems.An equal number of nearest neighbors is selected from each subset.This ensures that both modalities are present in the exemplar set.PromptPG(Lu et al., 2023) is a reinforcement learning method that trains a policy network to select well-performing exemplars among a fixed candidate pool.At inference time, the policy decides which exemplars from the candidate set are selected for a given test instance.For training, we use the same parameters asLu et al. (2023), except the size of the candidate exemplar set which is set to 20 and 40 for FinQA and TAT-QA respectively.
The Random baseline randomly selects a set of exemplars from the training set for each test instance.We define CSP (Constraint Satisfaction Problem) as SEER without the objective function.A candidate set is randomly selected among those that meet all the Knapsack's con-straints.

Table 1 :
Table of main results (%).EA indicates the Execution Accuracy, PA the Program Accuracy, EM the Exact Match, and F1 the numeracy-focused F1 score.Reported results are averages over three iterations on the test set.CODEX is the backbone LLM for ICL methods.

Table 2 :
, indicate that the difference in EA and EM between the lowest and Accuracy (%) of constraint modules on the dev and test sets.

Table 3 :
The 1024 token budget is too restrictive to even fit most TAT-QA test instances.Hence, we only consider FinQA for that setting.Figure 7: Performance of SEER under a 2048 token capacity budget, as evaluated on the dev set.The coverage is the percentage of exemplar set respecting the budget.The score is the EA (%) for FinQA and EM (%) for TAT-QA.Percentage of errors per category.
where other i = span i + mspan i + count i where other i = span i + mspan i + arith i C TAT-QA Prompt examples Figure 9 illustrate an instance from the TAT-QA dataset.The answer type is multi-span and the modality is text.Algorithm 1 Selection of ExEmplars for hybrid Reasoning (SEER) Input Test instance X test , training set S train , candidate pool size k Output Exemplar selection E selection E candidates ← KN N (X test , S train , k) ▷ KNN cosine similarity attributes ← {} for module ∈ constraint_modules do attributes[module.name]← module.predict(Xtest ) end for knapsak ← get_ilp(X test , attributes) ▷ Generate the Knapsack E selection ← solve(knapsack, E candidates ) ▷ Solve the Knapsack program i∈S table i x i ≥ α M i∈S other_i x i ≥ β M where other_i = text i + hybrid i 2) Predicted modality: text subject to i∈S text i x i ≥ α M i∈S span i x i ≥ α M i∈S other i x i ≥ β M where other i = mspan i + arith i + count i5) Predict answer type: multi-span subject toi∈S mspan i x i ≥ α M i∈S other i x i ≥ β Mwhere other i = span i + arith i + count i6) Predict answer type: arithmetic subject toi∈S arith i x i ≥ α M i∈S other i x i ≥ β M i∈S count i x i ≥ α M i∈S other i x i ≥ β M