FinQA: A Dataset of Numerical Reasoning over Financial Data

The sheer volume of financial statements makes it difficult for humans to access and analyze a business’s financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset – the first of its kind – should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available at https://github.com/czyssrs/FinQA.


Introduction
Financial analysis is a critical means of assessing business performance, and the consequences of poor analysis can involve costs of billions of dollars (Jerven, 2013;MacKenzie, 2008). To facilitate high quality, timely decision making, professionals -such as analysts or investors -perform complex quantitative analysis to select information from financial reports. Such analysis demands advanced expertise in reasoning among heterogeneous (structured and unstructured) data sources and performing complex numerical reasoning, for example, comparing financial ratios of profitability or growth. These challenges are compounded 1 https://github.com/czyssrs/FinQA by an exponentially expanding collection of company financial documents (MacKenzie et al., 2012;Lange et al., 2016) such that it is genuinely unclear whether dedicated human effort can produce fiscal analysis of sufficient quality for current decision making. This poses an interesting question: can we automate such deep analysis of financial data?
A few NLP studies in Question Answering (QA) explored the numerical reasoning capabilities needed to answer questions correctly. For example, the DROP dataset (Dua et al., 2019) focused on Wikipedia-based questions that require numerical reasoning, e.g., "Where did Charles travel to first, Castile or Barcelona?" needs a comparison between the times of two events. However, most prior work only targeted the general domain, where the questions involve much less calculation (mostly one-step calculation) than that of the financial domain. Financial QA is more challenging than classic QA (Rajpurkar et al., 2018; because it requires the system to spot relevant information across heterogeneous sources, such as tables and unstructured texts, and then create a numerical reasoning path to connect all the information. It also takes substantial knowledge to ask meaningful financial questions. It is not clear how well the large language models, which performed well for general-domain QA, can be adapted to answer realistic, complex financial questions. This paper introduces FINQA, a expertannotated dataset that contains 8,281 financial QA pairs, along with their numerical reasoning processes. Eleven finance professionals collectively constructed FINQA based on the earnings reports of S&P 500 companies (Zheng et al., 2021). ing processes answering these questions are made of many common calculations in financial analysis, such as addition, comparison, and table aggregation. To the best of our knowledge, FINQA is the first dataset of its kind to tackle complicated QA tasks based on the real-world financial documents.
We propose a retriever-generator QA framework to first retrieve supporting facts from financial reports, then to generate executable reasoning programs to answer the questions. Equipped with pretrained language models, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), our proposed approach outperforms all other baselines and achieves an execution accuracy of 65.05%. Although our system outperforms the non-expert crowd (50.68%), the significant accuracy gap between the model and human experts (91.16%) motivates the need for future research.
The main contribution of this work is three-fold: • We propose the task of QA over financial data to assist financial analysis. The task emphasizes an important phenomenon for the NLP community to study and analyze how the current pre-trained models perform on complex and specialized domains.
• We construct a new large-scale dataset, FINQA, with 8,281 examples written by financial experts, with fully annotated numerical reasoning programs.
• We experiment on various baselines and find that the models are still far behind expert performance, strongly motivating future research.  supat and Liang, 2015), Spider (Yu et al., 2018), TabFact (Chen et al., 2020b), etc. For reading comprehension, the dataset most related to ours is the DROP dataset (Dua et al., 2019), which applies simple calculations over texts. The top methods on DROP typically use specific prediction heads for each kind of calculation. HybridQA (Chen et al., 2020c) targets QA over both the table and the text, but not with the focus of numerical reasoning. All these existing datasets are built upon the general domain (mostly based on Wikipedia). In contrast, our dataset focus on the finance domain, which demonstrates much more complex nature in numerical reasoning questions, combining both the structured tables and unstructured texts. Another kind of QA datasets related to ours is the math word problem datasets, like MaWPS (Koncel-Kedziorski et al., 2016), MathQA (Amini et al., 2019). The task is to generate the solution programs given a short input math problem. Existing models include (Kim et al., 2020;Chen et al., 2020a,d), etc.
Financial NLP. Financial NLP has become one of the major application domains attracting growing attentions. Previous works in finance domain include risk management to detect fraud (Han et al., 2018;Nourbakhsh and Bang, 2019), sentiment analysis to assist market prediction (Day and Lee, 2016;Wang et al., 2013;Akhtar et al., 2017), opinionated Question Answering (Liu et al., 2020), such as the FiQA 2 dataset built from forums and social media. Recent works attempt to develop pre-trained models specialized for finance domain Araci, 2019), and the downstream tasks are mostly sentiment classifications. To the best of our knowledge, there is no previous work and dataset on building QA systems of numerical reasoning on financial reports.

Task Definition
Problem Formulation. Presented with a financial report consisting of textual contents E and structured table T , given a question Q, the task is to generate the reasoning program G = {w 0 , w 1 , ...w n }, where w i is the program tokens defined by domain specific language (DSL), then it is executed to get the answer A: Where {G i } is all the correct programs to evaluate to the answer. For financial tables, there is typically a description header (blue header in Figure 1), which often gives the timing information; and each row has its name on the left. Some of the financial tables may demonstrate more complicated layouts, e.g., nested structures. As a first step for this direction, in this paper we only focus on the regular layout cases for simplicity.
Each operation takes a list of arguments args n . On consulting with financial experts, as most of the accounting and financial valuation theory primarily include linear algebra, we include 10 common types of operations in our dataset. There are 6 mathematical operations: add, subtract, multiply, divide, greater, exp, and 4 table aggregation operations  The table operations take arguments of table row names. We use the special token #n to denote the result from the nth step. For example, in Figure 1, the program consists of 3 steps; The first and the second division steps take arguments from the table and the text, respectively, then the third step subtracts the results from the two previous steps. Refer to Appendix A for more details of the operations and the grammars.
Evaluations. Previous studies on QA with numerical reasoning only evaluate the execution accuracy, i.e., the final results from the generated programs, such as DROP (Dua et al., 2019) and MathQA (Amini et al., 2019). However, the applications for the finance domain generally pose much higher requirements of explainability and transparency. Therefore, we also provide the gold programs for our dataset. Besides execution accuracy, we also propose to evaluate the accuracy of the generated programs. Specifically, we replace all the arguments in a program with symbols, and then we evaluate if two symbolic programs are mathematically equivalent. For example, the following two programs are equivalent programs: add(a 1 , a 2 ), add(a 3 , a 4 ), subtract(#0, #1) add(a 4 , a 3 ), add(a 1 , a 2 ), subtract(#1, #0) Note that execution accuracy tends to overestimate the performance because sometimes the model just hit the correct answer by chance; While program accuracy tends to produce false negatives since some questions may have multiple correct programs.

Data Preparation
Data Source. We develop FINQA based on the publicly available earnings reports of S&P 500 companies from 1999 to 2019, collected in the FinTabNet dataset (Zheng et al., 2021). An earnings report is a set of pages in a PDF file that outlines the financials of a company, which usually contains tables and texts. The FinTabNet dataset has annotated the tables in each report.
Data Filtering. Realistic earnings reports contain many tables not suitable for numerical reasoning tasks. Equipped with the table annotations in FinTabNet, we filter the data as follows: First, we extract the pages in earnings reports with at most one table. Second, we exclude the tables with over 20 rows, over 2 description headers, or with other complex nested structures. We also exclude the tables with tedious contents, such as catalogs, which is common in FinTabNet. As stated in §3, these over-complicated tables are out of the scope of this work. Finally, for the tables with 2 description headers, we merge them into a single header to simplify the representations. As a result, a total of 12,719 pages were selected for further annotation.

Annotation Procedure
Recruiting Expert Annotators. We post job ads on UpWork 3 and hire eleven US-based experts with professional finance backgrounds (CPAs, MBAs, etc.) Each hire is interviewed using four example report pages and asked to compose example Q&A pairs. After hiring, each annotator first goes through a training session to learn the task and the annotation interface (Appendix D). When the workers fully master the annotation process, we launch the official batches for them to work on.
An annotator can compose up to two questions for each given report page or skip if it is hard to compose any meaningful question. We pay around $2.0 for each question, which leads to an average hourly wage of $35.0. The whole data collection took around eight weeks.
We do not use popular micro-task platforms, such as Amazon Mechanical Turk (MTurk), because our preliminary studies show that many MTurk workers can not perform this task effectively. Our experiment with MTurk workers in § 4.3 further echo this observation. As most existing QA datasets were constructed by MTurk workers Dua et al., 2019;Chen et al., 2020c), it requires substantial domain-specific knowledge to compose meaningful questions that are hard for computers to answer.
Annotation Task Design. For each page selected in §4.1, the annotators are asked to (i) write a meaningful financial question, (ii) compose a reasoning program to answer the question, and (iii) to annotate the supporting fact. Each page is assigned to one or two experts for annotation. We detail each part as follows. (I) Financial question: For a given page of earnings reports, the annotators are asked first to compose a question that is "meaningful for financial analysis or learning insights of the company financial reports" and require numerical calculations to answer. We encourage the experts to write questions that require the information from both the text and the table to answer. (II) Reasoning program: After providing the question, the annotators are then asked to elaborate the operation steps to answer the question. Specifically, they compose a maximum of 5 steps of operation, where each operation has four slots: "operation", "argument1", "argument2", and "result". The "operation" is one of the ten predefined operations described in §3. An "argument" is a number or a table's row name, either from the report or a previous step's result. For operations that only use one argument, such as table aggregation, workers can leave argument2 blank. The annotation interface (see Appendix D) automatically validates the inputs to ensure correctness. (III) Supporting fact: We also ask the annotators to mark all the sentences in the text and the table rows that contain the information needed to answer the question.

Data Quality Assessment
External experts answer FINQA questions with a high accuracy and a high inter-annotator agreement. To validate the quality of the annotations, as well as to set up human expert performance upper bound, we hire another two financial professionals on UpWork. We randomly sample 200 examples from our dataset, and ask the professionals to answer the questions as well as write the operation steps, following the same procedure as in the dataset construction. The payment is $2.0 per question. For execution accuracy, they reach 92.25% and 90.06%, respectively (mean = 91.16%). For program accuracy, they reach 89.44% and 85.53% (mean = 87.49%). The agreements between the two annotators are 92.65% for execution accuracy, and 86.76% for program accuracy.
Non-expert crowd workers answer FINQA questions with a low accuracy. We also test how well non-expert MTurk workers can answer FINQA questions. We distribute the samples to MTurk 4 and take the similar process to distribute each example to two workers. We end up with an average execution accuracy of 50.68% and a program accuracy of 48.17%, which is far below the expert performance; the agreement rate is only around 60%. These results echo our preliminary study's observations for MTurk workers in §4.2. has two pieces of facts; and 11.07% has more than two pieces of facts. For the examples with more than one piece of fact, we also calculate the maximum distances between all the same example's facts. 55.48% has a maximum distance of 3 or less sentences 5 ; 24.35% has a maximum distance of 4-6 sentences; and 20.17% has over 6 sentences.
Statistics of Reasoning Programs. In the programs, the most frequent operations, add, subtract, multiply, and divide, have the distributions of 14.98%, 28.20%, 5.82%, and 45.29%, respectively. The operation division has the highest frequency, as calculating ratios is common in financial analysis. In FINQA, 59.10% of the programs have 1 step, 32.71% have 2 steps, and the rest 8.19% have 3 or more steps.

Baseline Systems
In this section, we first describe our main baseline framework FinQANet in §5.1, and then we introduce other baselines in §5.2.
5 For tables, we consider one row as one "sentence".

Financial Report
Retrieved Facts Figure 2: The retriever retrieves supporting facts (text sentences or table rows) from the input financial report.

The FinQANet Framework
As a preliminary attempt on FINQA, we propose FinQANet, with a retriever to first retrieve the supporting facts from the input financial report, then a generator to generate the program to get the answer.
Retriever The full page of the financial report can go beyond 2,000 tokens, which cannot be coped with the current popular QA models (Devlin et al., 2019). Therefore we first retrieve the supporting facts from the input report. For the tables, we use templates to turn each row into sentences. For example, the last row of the table in Figure 1 is represented as 'the risk-free interest rate of 2006 is 5%; ...'. We concatenate each supporting fact with the question and train a classifier using pre-trained LMs like BERT (Devlin et al., 2019). Then we take the top n retrieved facts, reordered as they appear in the input report. This set of retriever results will serve as the input to the second phase. Figure 2 illustrates the retrieving procedure. Another common strategy is sliding window (Alberti et al., 2019). We take the sliding window of a fixed size with a stride to go through the report, then the windows containing all the supporting facts are marked as positive. However, we observe in the experiments that the length of the input to the program generator in the second phase greatly influences the performance. The performance of using sliding window falls far behind the previous method.
Program Generator Given the retrieved supporting facts from the retriever, the program generator aims to generate the executable program to answer the question. Figure 3 gives an overview of the program generator. The generated tokens come from 3 sources: 1) The input passage (retriever output) and the question tokens {e i }, like the numbers or the table row names.
2) The special tokens {s i } from the DSL, like the function names, predefined

Input encoder
Step memory embeddings 9413 add( ) 8249 #0 divide( Step memory embeddings ) 8249 Step ]. An LSTM is used for decoding. At each decoding step T , the program token embeddings H are fed as the input; The decoder output h T is used to calculate the attention vector att p and att h over the input and the decoding history. Then a context vector c T combines all the contextual information: Meanwhile, another attention vector att p over the input is applied to all the token embeddings: Different from other program tokens, the step memory tokens {m i } imply the reasoning path of the program. To make use of such structure information, at each decoding step indicating the end of one operation[args] unit, i.e., the step to generate the ending parentheses in our DSL, we compute another context vector a T : Then the step memory token embedding corresponding to the current step is updated as a T . The final prediction is calculated with: During inference time, based on the grammar of the DSL, we use masks at each decoding step to ensure the structural correctness of the generated programs. In the retriever phase, we take the top n retrieved results as the input to the program generator. Therefore, for the training of the program generator, we use the retriever result on the training set (combined with the gold facts if there is any wrong prediction) as the input.

Other Baselines
TF-IDF + Single Op. We use TF-IDF to retrieve the top 2 sentences from the input report. Since the most common case in our dataset is one-step program and the most common operation is division, we take the first number from each sentence and apply the division operation.
Retriever + Direct Generation. To demonstrate the necessity of generating the reasoning programs, we keep the architecture the same as our model, but directly generating the final results.
Retriever + Seq2seq. We use a Seq2seq architecture for the generator, similar to the Seq2seq baseline in the MathQA dataset (Amini et al., 2019). A bi-LSTM is used for encoding the input, and then an LSTM is used for decoding with attention.
Retriever + NeRd. The Neural Symbolic Reader(NeRd) (Chen et al., 2020d) is also a pointergenerator based model for program generation, with the state of the art results on the MathQA dataset (Amini et al., 2019). Different from ours, it directly learns the program with nested format as a sequence, i.e., without the step memory tokens. This way the model is able to learn the program structures as patterns from very large-scale data (~40k for MathQA), but may fail on learning the reasoning paths. We keep the retriever part the same and compare with the generator part to demonstrate the usefulness of structure learning.
Pre-Trained Longformer. There are also works on modeling very long documents with thousands of characters, with the attention mechanism that scales linearly with sequence length, like the Longformer (Beltagy et al., 2020). To demonstrate the necessity of breaking up into the pipeline of retriever and program generator, we remove the retriever and directly use the pre-trained Longformer as the input encoder in the program generator, and encode the whole report. The table rows are linearized similar as in §5.1.

Experimental Results
Experiment Setups. For the retriever, we use BERT-base as the classifier (other pre-trained models perform similarly). Since most of the examples in our dataset have 1 or 2 facts, and we find that longer inputs lower the performance of the program generator, we take the top 3 ranked facts as the retriever results. For the program generator, we experiment on using BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and FinBert (Araci, 2019) as the encoder, to test the performances of popular large pre-trained models. For all models, we use the Adam optimizer (Kingma and Ba, 2015). Check Appendix B for more details of training and parameter settings. Table 2 presents the results for all the baseline systems. We evaluate the execution accuracy (exe acc) and program accuracy (prog acc) as explained in §3. For the BERT-based retriever, we have 89.66% recall for the top 3 retrieved facts and 93.63% recall for the top 5. Using TF-IDF results in 82.91% recall for the top 5 facts. We use the same retriever results for all retriever-generator based models. Directly generating the execution results gives nearzero scores, which indicates the necessity of generating the reasoning programs. If without using the retriever-generator pipeline, but directly applying an end-to-end pre-trained Longformer model, the performance falls far behind. Because longer inputs have more numbers which put more confusions on the program generator and thus make it harder to learn. Generally, the program generators using pre-trained models perform much better than the Seq2seq baseline, as there is language modeling knowledge that can also be used for the finance domain. And larger pre-trained models give better performance, as they tend to see more financial corpus during their pre-training. FinBert (Araci, 2019) is a pre-trained model for the finance domain; its main downstream tasks are sentiment analysis.  The performance of using FinBert is no better than BERT-large, mostly because its pre-training corpus is limited (~30M words from news articles).

QA Model Performance
Comparing FinQANet with the retriever + NeRd baseline (Chen et al., 2020d), it shows the improvements from learning the logical structure of the programs. We also run the program generator using the gold retriever result, shown as FinQANet-Gold. Another interesting observation is the comparisons with human performances. While there is still a large gap from the human expert upper bound, the best performing model already surpasses the general crowd performance.

Performance Breakdown
We conduct a set of performance breakdowns using the FinQANet (RoBERTa-large) model. Table 3 shows all the results.
Necessity of using both table and text. We run inferences taking facts only from a single source from the retriever. Inferences on individual source ( what is the amount of credit lines that has been drawn in millions as of year-end 2016? [1] additionally , we have other committed and uncommitted credit lines of $ 746 million with major international banks and financial institutions to support our general global funding needs , including with respect to bank supported letters of credit, performance bonds and guarantees . [2] approximately $ 554 million of these credit lines were available for use as of year-end 2016 . [1] we maintained a $ 1.4 billion senior credit facility with various financial institutions , including the $ 420.5 million term loan and a $ 945.5 million revolving credit facility .  Figure 4: Error cases. In these examples, the retriever results all correctly cover the gold facts; thus we only present the gold facts, gold program, and the predicted program to study the errors of the program generator. We give more error cases in Appendix C, including the cases for the retriever errors.  Questions that need more than two steps to answer are challenging. The model has a low accuracy (22.78%) on the questions that need three or more steps. Meanwhile, not surprisingly, the questions that require only one step are the easiest.
Constants in programs. Many programs in FINQA contain constants as arguments. A constant is often used to convert an English number word to another. For example, we need first to use the constant "1,000" to convert "1.5 billion" to "1,500 million" so that it can be added with "50 million". A constant is also used to explicate the implicit numbers hidden in the language. For example, to calculate "the average for the year 2012, 2013, and 2014", the program needs to use the constant "3" as the denominator, which is not mentioned explicitly in the text. As shown in Table 3, the programs with constants yield great challenges for our model, as the performance (43.88%) is much lower than that of the whole set (61.24%).

Error Analysis
We sample 50 error cases from the results of the FinQANet (RoBERTa-large) model and analyze them manually. 15% of the errors are caused by the retriever, e.g., missing facts. Half of the rest are due to the lack of financial knowledge, such as the meaning of some terminology. And the rest half are primarily numerical reasoning errors, including complex programs with multiple steps, numerical unit conversions, or resolving the ordering and matching of the numbers and the years. Many error cases involve both the numerical reasoning problems and misunderstandings of financial knowledge. We show three representative error cases in Figure 4.

Conclusion and Future Work
This paper introduces FINQA, a new expertannotated QA dataset that aims to tackle numerical reasoning over real-world financial data. The questions in FINQA pose great challenge for existing models to resolve domain-specific knowledge, as well as to acquire complex numerical reasoning abilities. We propose baseline frameworks and conduct comprehensive experiments and analysis. The results show that current large pre-trained models still fall far behind the human expert performance. This encourages potential future work on developing pre-training tasks for such realistic, complex application domains. We believe FINQA should serve as a valuable resource for the research community.

Ethical Considerations
Data Access and Licensing. We develop FINQA based on the publicly available earnings reports of S&P 500 companies from 1999 to 2019, collected in the FinTabNet dataset (Zheng et al., 2021). The FinTabNet dataset is publicly available under the CDLA-Permissive 6 license, which permits us to create additional annotations on top of the data ("Enhanced Data", §1.5 of CDLA) and publish the annotations ("Publish", §1.9 of CDLA).

Dataset Collection Process and Conditions.
For the annotation of our FINQA dataset on Upwork, we first launch interviews of the task introduction with 4 example questions, which is paid as $30, for them to try a few examples to get informed and familiar with the task. Then based on their consents to continue working on the large-scale job, we discuss with the workers to reach agreements on the compensation before starting the large-scale job. We pay around $2.0 per question, and the hourly rates are discussed and agreed upon with both sides based on the working speed of different workers. Among all eleven US-based hires, the average hourly rate is $35.0, and the minimum and maximum hourly rates are $20 and $50, respectively. The evaluation tasks follow the similar procedure, and each question is paid as $2.0.

IRB (Institutional Review Board) Approval.
This project is approved by our Institutional Review Board (IRB). The systems trained using our dataset are primarily intended to be used as augmenting human decision-making in financial analysis, but not as a replacement of human experts.   Input Report AWK/2014/page_121.pdf … (abbreviate 20 sentences)... the ppaca effectively changes the tax treatment of federal subsidies paid to sponsors of retiree health benefit plans that provide a benefit that is at least actuarially equivalent to the benefits under medicare part d . the acts effectively make the subsidy payments taxable in tax years beginning after december 31 , 2012 and as a result , the company followed its original accounting for the underfunded status of the other postretirement benefits for the medicare part d adjustment and recorded a reduction in deferred tax assets and an increase in its regulatory assets amounting to $ 6348 and $ 6241 at december 31 , 2014 and 2013 , respectively . the following table summarizes the changes in the company 2019s gross liability , excluding interest and penalties , for unrecognized tax benefits: .  Input Report K/2013/page_23.pdf-1 … (abbreviate 12 sentences)... underlying gross margin declined by 180 basis points in 2012 as a result of cost inflation , net of cost savings , and the lower margin structure of the pringles business . underlying sga% ( sga % ) was consistent with 2011 . our underlying gross profit , underlying sga , and underlying operating profit measures are reconciled to the most comparable gaap measure as follows: