Retrieval-Augmented Chain-of-Thought in Semi-structured Domains

Applying existing question answering (QA) systems to specialized domains like law and finance presents challenges that necessitate domain expertise. Although large language models (LLMs) have shown impressive language comprehension and in-context learning capabilities, their inability to handle very long inputs/contexts is well known. Tasks specific to these domains need significant background knowledge, leading to contexts that can often exceed the maximum length that existing LLMs can process. This study explores leveraging the semi-structured nature of legal and financial data to efficiently retrieve relevant context, enabling the use of LLMs for domain-specialized QA. The resulting system outperforms contemporary models and also provides useful explanations for the answers, encouraging the integration of LLMs into legal and financial NLP systems for future research.


Introduction
Building NLP systems for answering questions in the legal and financial domains could save time and resources, ensure compliance, and enhance the overall accuracy and effectiveness of legal and financial operations (Nay et al., 2023;Yang et al., 2023).Applying QA systems to such domains poses unique challenges.These domains feature complex jargon, nuanced phrasing, and contextual dependencies that require specialized knowledge and expertise (Katz et al., 2023;Wu et al., 2023a).A system tailored to these domains should be able to efficiently process and analyze large volumes of legal, financial, or regulatory documents, extracting relevant insights and answering targeted queries.
Large language models (LLMs) have shown impressive performance on several NLP tasks (Zhao et al., 2023).de Padua et al. (2023) show that LLMs trained on large amounts of data are able to obtain the necessary domain knowledge through incontext learning (ICL) (Brown et al., 2020).How-ever, a major limitation of LLMs is the limit on the input size.There are many attempts to address this limitation (Press et al., 2022;Haviv et al., 2022;Zhu et al., 2023) and multiple transformer models are able to handle longer contexts (OpenAI, 2023;Rozière et al., 2023;Dai et al., 2019;Sun et al., 2023b).However, (Liu et al., 2023) show that model performance on certain parts of the input decreases with input size.Further, the cost and latency of LLMs increases with the input size.
The context required for legal and financial questions is often large and may not fit within the token limit, requiring more efficient retrieval.Financial and legal documents are often semi-structured.For example, Figure 1 shows a section from the US Internal Revenue Code.The text is organized into subsections, paragraphs and bullet points, which we leverage for better information retrieval.Further, financial reports often contain quantities in tabular format.We exploit these structures in a prompting approach that incorporates retrieval to workaround the context token limit.
We evaluate the proposed method on two datasets: FinQA (Chen et al., 2021) and SARA (Holzenberger et al., 2020).These datasets feature complex questions which require multiple steps of reasoning and arithmetic computations, which is challenging for language systems.We adopt chainof-thought (CoT) prompting (Wei et al., 2023) for generating the answers since it is well suited for performing reasoning in a step-by-step manner.A chain of thought is a coherent sequence of reasoning steps that lead to the correct answer in a stepby-step manner.Providing examples of questionanswer pairs along with their CoTs prepended to the test question causes GPT-3 to likewise output a CoT along with the answer for the test question, and improve its overall reasoning accuracy.CoT prompting is especially useful for complex tasks which require multiple steps of reasoning over the given input.The results demonstrate that this simple and efficient approach outperforms state-of-the-art models in these domains.Training LLMs on financial and legal data may not be feasible as they may contain sensitive information.The use of ICL circumvents the problem and avoids expensive and tedious process of data collection and training.This makes the proposed approach a practical solution in scenarios where labeled data is limited or expensive to obtain.Additionally, CoT prompting offers the advantage of generating explanations and facilitating interpretability in critical domains where it is a key obstacle for the adoption of AI systems (Danilevsky et al., 2020).However, a major drawback of the approach is that it is task-specific.In particular, the retrieval relies on the structure within in the data and needs to be adapted to data from different sources1 .
We hope our work fosters research on coupling LLMs with retrieval in domains such as finance and law, where the ability to extract insights and answer questions about vast amounts of domain-specific data has many practical applications.

Related work
Previous work has proposed training specialized LLMs for financial and legal domains (Wu et al., 2023a;Huang et al., 2023;Nguyen, 2023;Yang et al., 2023).However, doing so requires a large amount of data, compute and cost.Sun et al. (2023a) evaluate GPT-2 on FinQA (Chen et al., 2021).Blair-Stanek et al. (2023) evaluate GPT-3 with different prompting techniques on SARA where the context includes all the sections from the statutes.Since the input size of GPT-3 is limited, the prompts only included a subset of sections, which may not contain the required information.Further, fewer in-context examples were used for CoT as compared to few-shot learning.Li et al. (2023) and Wu et al. 2023b observe better performance with more in-context examples.Nay et al. (2023) test various GPT models with ICL to answer multiple choice questions over tax laws.A retriever augmented setting is tested where a dense passage retriver, GTR (Ni et al., 2021), retrieves the top 4 relevant sections to the questions.Since entire sections are passed to the LLMs, the text has to be truncated.
This study extends past work by complementing LLMs with a retriever that extracts the relevant text from within the statutes, allowing for larger contexts and more in-context examples in the prompt.

Data
We use two datasets containing questions that involve multi-step logical and arithmetic reasoning from the legal and financial domains respectively.

SARA
StAtutory Reasoning Assessment dataset (SARA) (Holzenberger et al., 2020) is designed to evaluate statutory reasoning over a set of sections extracted from the US Internal Revenue Code (IRC).For each of the subsections contained in the selected sections, there are two hand-written case scenarios.Correctly solving these cases requires multiple steps of arithmetic as well as logical reasoning.For instance, some cases require computing the amount of tax owed according to a given section, only if the section applies to the given case.Thus, this dataset serves as a challenging task for an AI system, requiring domain expertise and reasoning abilities.

FinQA
FinQA (Chen et al., 2021) is a financial QA dataset.It comprises of 8,281 examples where each question is accompanied by a financial report, containing text as well as a table.The report contains the necessary information to correctly answer the question.FinQA poses many challenges for a QA system.The questions require retrieval, arithmetic and logical reasoning simultaneously over tables Alice is married under section 7703 for 2017.Taxable income for the year 2017 is $103272.
Alice has to pay $24543 in taxes for 2017 under section 1(a)(iii).True or False?
Retriever §63.Taxable income defined (a) In general Except as provided in subsection (b) Individuals who do not itemize...In the case of an individual who ...
(  and text.The questions also require an understanding of financial jargon.Finally, multiple reasoning steps are required to derive the answer.

Methodology
Figure 2 shows an overview of our proposed approach, which consists of two main components: retrieval and answering.The retrieval step involves filtering paragraphs from text and rows from tables that are relevant to the question.The retrieved information is then passed to the answering model.

Retrieval
Retrieval is essential for fully leveraging ICL and CoT reasoning abilities of LLMs.It can help to prevent the required context from exceeding the token limit, while also allowing the prompt to include enough in-context examples along with their CoT explanations.It can also help to reduce the time and cost of inference.We propose to leverage the structure present in the data to retrieve the relevant context from the legal statutes and financial reports.This structure is specific to the data source and the retriever needs to be designed accordingly.In our analysis, we explore datasets with two different sources: SARA, where a template-based algorithm can be used for effective retrieval; and FinQA, where a more sophisticated pre-trained retrieval model is required.

SARA
As shown in Figure 1, the statutes in SARA are organized in a hierarchical structure with sections, sub-sections, paragraphs, and bullets.This hierarchical structure offers valuable information for efficient and accurate retrieval.Next, a rule-based statute parser extracts the mentioned sub-section.The parser reads each sentence in the given statutes and assigns it to the most specific sub-section to which the sentence belongs.Figure 1 shows an example of a parsed statute section.We explore three retrieval strategies: 1. mentioned-only: The retriever returns all the sentences that are assigned to sub-sections containing the queried sub-section as a prefix.For Figure 1, a query for sub-section 7703(a)(1) will result in sentences assigned to s7703, s7703(a) and s7703(a)(1).2. entire-section: Retriever returns the entire sub-section.In Figure 1, a query for subsection 7703(a)(1) will result in sentences assigned to s7703, s7703(a), s7703(a)(1), as well as s7703(a)(2).3. references: Retriever returns sub-sections mentioned in the question along with those that are referenced in these retrieved subsections2 .

FinQA
The absence of a hierarchical structure in FinQA reports makes it impractical to adopt a rule-based approach for retrieval.Chen et al. (2021) convert the tables into text and then use BERT for retrieving relevant sentences from the report.
However, using templates to convert tables into text leads to very long contexts.These templates can also introduce grammatical and logical errors, leading to a loss in the performance of the answering module.Thus, we use a tabular format during the answering step in order to exploit the structure (see Figure 5 in Appendix).

Answering
In this study, we test GPT-3 (text-davinci-003) (Brown et al., 2020) and LLaMA-2 (Touvron et al., 2023) to answer the queries.We experiment with different prompting techniques, namely zero-shot, few-shot and chain-of-thought prompting.CoT prompting has been shown to improve the ICL abilities of sufficiently large LLMs (Zhao et al., 2023) and is especially useful for tasks that require multiple steps of reasoning.
In the zero-shot setting, the model is given the retrieved context and the question and is expected to output just the answer without any explanation.
In the few-shot setting, we further include incontext examples of question-answer pairs (8 examples for SARA and 12 examples for FinQA3 ).
In the CoT setting, we use the same in-context examples as used for the few-shot setting but each example also includes a CoT explanation.These explanations are manually written for each example.The model is expected to generate the answer along with the CoT explanations for the test cases.
For all questions in a dataset, we use the same prompt containing the same in-context examples which are selected using prompt tuning as described in Appendix section A.1.
Figures 4 and 5 in the Appendix show the CoT prompts used for SARA and FinQA respectively.

Evaluation
For SARA, the task is formulated as an entailment task and is evaluated as a binary classification task.
For FinQA, Chen et al. (2021) propose program accuracy where the model is expected to generate a 'program' along with the answer.A program is a sequence of mathematical operations that leads to the final answer.The evaluation thus compares
We also measure the answer accuracy by ignoring errors only in units, prefix, suffix, precision digits or rounding errors.

Comparison with existing methods
Tables 1 and 2 show results on SARA and FinQA respectively4 .Descriptions of the baselines are provided in Appendix Section B.
On SARA, both GPT-3 and LLaMA2-70B surpass the existing methods by a significant margin.We also observe the expected trend of the performance improving with the increase in the model size, with GPT-3 (175B) performing significantly better with LLaMA-2 models (Kaplan et al., 2020) 5 .
On the other hand, the performance on FinQA with GPT-3 is comparable with baselines in terms of program accuracy but lags behind in answer accuracy.We believe this behavior is due to arithmetic errors made by LLMs (Qian et al., 2023), resulting in cases with correct programs but incorrect answers.Our approach with LLaMA2-13B/70B and GPT-3 outperforms general crowd workers  who lack domain expertise in finance, whereas it falls short compared to financial experts.
The bottom section of Table 2 highlights the effectiveness of GPT-3 over FinQANet (Chen et al., 2021) when provided with the gold retrieved results.However, LLaMA-2 shows sub-par performance.

Ablation studies
Comparison of prompting techniques: Table 4 in the Appendix shows the evaluation results for zeroshot, few-shot and CoT prompting.CoT prompting leads to significantly better results across all models.
Comparing retrieval strategies: As outlined in section 4, we test three different retrieval strategies for SARA.Table 3 reveals that mentioned-only and references perform significantly better than entire-section.The questions in SARA are designed in a way where additional context apart from the mentioned sub-sections is not required.The difference in accuracy indicates the benefit of more targeted retrieval for model performance, since over-retrieval may dilute the signal provided by more directly relevant context.Case analysis We perform manual qualitative inspection of the generated CoT explanations and report the analysis in Appendix section D.2.

Discussion
This study aims to utilize LLMs for challenging domain-specific QA tasks by using ICL along with retrieval techniques that leverage the semistructured nature of financial and legal data.The proposed approach is simple and performs well compared to existing systems.It exploits ICL which avoids the costly and time-consuming processes of data collection and training.Since the proposed system produces a chain-of-thought with each output, it is easily interpretable and errors can be identified and rectified by human supervision (Danilevsky et al., 2020).
We hope this work will encourage researchers to delve deeper into the analysis and development of LLM-integrated NLP systems and retrievalaugmented LLMs.

Limitations
The retrieval algorithms in our study are specifically tailored to each dataset.Despite good reasoning abilities, the evaluation reveals that arithmetic errors are common.Further, inference with LLMs can be costly with latency higher than traditional approaches, making it sub-optimal for handling large volumes of data efficiently.
These limitations point to interesting future directions such as using arithmetic tools as plugins (Schick et al., 2023) for better performance and more generalizable retrieval algorithms.Further, several domain-specific LLMs can be tested (Huang et al., 2023;Yang et al., 2023;Wu et al., 2023a).

A.1 Prompt tuning
We iteratively refine the prompt using the validation sets of 40 samples for each dataset, with the aim of finding a prompt that encompasses a diverse range of cases while avoiding an overabundance of trivial or similar examples.

B Baselines
On SARA, we evaluate our system against the following baselines: • Majority baseline: A trivial baseline that predicts the majority class for all the samples.
• Legal-BERT: A BERT model trained specifically on legal domain (Chalkidis et al., 2020) and adapted for SARA by Holzenberger et al. (2020).
On FinQA, we compare our system with the following baselines in • APOLLO: (Sun et al., 2023a) The retriever is based on the sequence-pair classification following (Nogueira and Cho, 2020).The program generator leverages a BERT encoder and an LSTM decoder with attention mechanism along with consistency-based reinforcement learning.
Figure 3: A case in SARA for section 7703.

Correct CoT Incorrect CoT
Correct ans 23 8 Incorrect ans 3 6 Table 5: Results of the manual analysis performed on the validation set using GPT-3.Table 4 shows the performance of different models with different prompting techniques.

D.2 Case analysis
SARA We conducted a manual analysis of the model's output on the validation set.Table 5 presents the results of this analysis on SARA, indicating the number of examples where both the answer and the chain-of-thought reasoning provided by the model were correct, both were incorrect and cases where one of them was incorrect.We found that in 58.5% of the examples, the model accurately predicted both the output and the reasoning.For the remaining cases, we categorized the errors into four distinct categories, shown in Table 6.
Reasoning Error type # of cases FinQA On the constructed validation set comprising 40 samples, we observe that 30 samples have correct answers as well as corresponding programs.
For the remaining 10 samples, we manually classify the errors into different categories, as shown in Table 7.

Figure 1 :
Figure 1: An example of a statute from US Internal Revenue Code (left) and the subsection name assigned to each sentence after parsing as described in section 4.1.1(right).

Figure 2 :
Figure2: An overview of the proposed system on a sample input from SARA.The retriever extracts the relevant information from the context and combines it with the question.In-context examples are appended with the retrieval output to construct a prompt which is used for querying LLMs to generate an answer along the chain-of-thought.

Figure 3
Figure3in the Appendix shows an example of a question from SARA.The questions contain references to the specific sub-sections they pertain to.Firstly, a simple regular expression-based extractor scans the question text to identify the relevant section name.Next, a rule-based statute parser extracts the mentioned sub-section.The parser reads each sentence in the given statutes and assigns it to the most specific sub-section to which the sentence belongs.Figure1shows an example of a parsed statute section.We explore three retrieval strategies:1.mentioned-only: The retriever returns all the sentences that are assigned to sub-sections containing the queried sub-section as a prefix.For Figure1, a query for sub-section 7703(a)(1) will result in sentences assigned to s7703, s7703(a) and s7703(a)(1).2. entire-section: Retriever returns the entire sub-section.In Figure1, a query for subsection 7703(a)(1) will result in sentences assigned to s7703, s7703(a), s7703(a)(1), as well as s7703(a)(2).3. references: Retriever returns sub-sections mentioned in the question along with those that are referenced in these retrieved subsections 2 .

Figure
Figure 3 shows a question from SARA.D Ablation studies D.1 Zero-shot, few-shot and CoT prompting

Figures 4
Figures4 and 5show the prompts used for SARA and FinQA respectively.

Figure 4 :
Figure 4: Chain-of-thought prompt for SARA for a sample.The complete prompt contains 8 in-context examples with CoT explanations followed by the question that the model is supposed to answer.The in-context examples and explanations remain the same for all questions in the dataset.The text highlighted in yellow are the CoT explanations that we hand-crafted, while the test question is shown in blue.

Figure 5 :
Figure 5: Chain-of-thought prompt for FinQA for a sample.The complete prompt contains 12 in-context examples with CoT explanations followed by the question that the model is supposed to answer.The in-context examples and explanations remain the same for all questions in the dataset.The text highlighted in yellow are the CoT explanations that we hand-crafted, while the test question is shown in blue.

Table 1 :
Comparison of proposed system's performance on SARA with the existing baselines.The top section shows non-LLM based methods.The middle section shows the evaluation results from Blair-Stanek et al.(2023).The bottom section shows the results of our proposed system with 'Ret' representing the proposed retrieval.Results are shown with the 90% confidence interval.

Table 2 :
Comparison with state of the art and baselines methods on FinQA.Results are presented with a 90% confidence interval.

Table 3 :
Comparison of the three retrieval strategies used with GPT-3 on SARA validation set.

Table 6 :
Error analysis on the validation set of SARA.

Table 7 :
Error analysis of the incorrect examples on 40 samples from FinQA.