TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Hybrid data combining both tabular and textual content (e.g., financial reports) are quite pervasive in the real world. However, Question Answering (QA) over such hybrid data is largely neglected in existing research. In this work, we extract samples from real financial reports to build a new large-scale QA dataset containing both Tabular And Textual data, named TAT-QA, where numerical reasoning is usually required to infer the answer, such as addition, subtraction, multiplication, division, counting, comparison/sorting, and the compositions. We further propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text. It adopts sequence tagging to extract relevant cells from the table along with relevant spans from the text to infer their semantics, and then applies symbolic reasoning over them with a set of aggregation operators to arrive at the final answer. TAGOP achieves 58.0% inF1, which is an 11.1% absolute increase over the previous best baseline model, according to our experiments on TAT-QA. But this result still lags far behind performance of expert human, i.e.90.8% in F1. It is demonstrated that our TAT-QA is very challenging and can serve as a benchmark for training and testing powerful QA models that address hybrid form data.

In the real world, a more common hybrid data form is, the table (that usually contains numbers) is more comprehensively linked to text, e.g., semantically related or complementary. Such hybrid data are very pervasive in various scenarios like scientific research papers, medical reports, financial reports, etc. The left box of Figure 1 shows a real example from some financial report, where there is a table containing row/column header and numbers inside, and also some paragraphs describing it. We call the hybrid data like this example hybrid context in QA problems, as it contains both tabular and textual content, and call the paragraphs associated paragraphs to the table. To comprehend and answer a question from such hybrid context relies on the close relation between table and paragraphs, and usually requires numerical reasoning. For example, one needs to identify "revenue from the external customers" in the describing text so as to understand the content of the table. As for "How much does the commercial cloud revenue account for the total revenue in 2019?", one needs to get the total revenue in 2019, i.e. "125, 843 million" from the table and commercial cloud revenue, i.e. "38.1 billion", from the text to infer the answer.
To stimulate progress of QA research over such hybrid data, we propose a new dataset, named TAT-QA (Tabular And Textual dataset for Question Answering). The hybrid contexts in TAT-QA are extracted from real-world financial reports, each Revenue from external customers, classified by significant product and service offerings, was as follows: Our commercial cloud revenue, which includes Office 365 Commercial, Azure, the commercial portion of LinkedIn, Dynamics 365, and other commercial cloud properties, was $38.1 billion, $26.6 billion and $16.2 billion in fiscal years 2019, 2018, and 2017, respectively. These amounts are primarily included in Office products and cloud services, Server products and cloud services, and LinkedIn in the table above. composed of a table with row/col header and numbers, as well as at least two paragraphs that describe, analyse or complement the content of this table. Given hybrid contexts, we invite annotators with financial knowledge to generate questions that are useful in real-world financial analyses and provide answers accordingly. It is worth mentioning that a large portion of questions in TAT-QA demand numerical reasoning, for which derivation of the answer is also labeled to facilitate developing explainable models. In total, TAT-QA contains 16, 552 questions associated with 2, 757 hybrid contexts from 182 reports.
We further propose a novel TAGOP model based on TAT-QA. Taking as input the given question,  table and associated paragraphs, TAGOP applies  sequence tagging to extract relevant cells from the  table and relevant spans from text as the evidences. Then it applies symbolic reasoning over them with a set of aggregation operators to arrive at the final answer. Predicting the magnitude of a number is an important aspect when tackling hybrid data in TAT-QA, including thousand, million, billion, etc. that are often omitted or shown only in headers or associated paragraphs of the table for brevity. We term such magnitude of a number as its scale. Take Question 6 in Figure 1 as an example: "How much of the total revenue in 2018 did not come from devices?" The numerical value in the answer is obtained by subtraction: "110, 360 -5, 134", while the scale "million" is identified from the first-row header of the table. In TAGOP, we incorporate a multi-class classifier for scale prediction.
We test three types of QA models on TAT-QA, specially addressing tabular, textual, and hybrid data. Our TAGOP achieves 58.0% in terms of F 1 , which is a 11.1% absolute increase over the best baseline model, according to our experiments on TAT-QA. It is worth noting that the results still lag far behind performance of human experts, i.e. 90.8% in F 1 . We can see that to tackle the QA task over the hybrid data as in TAT-QA is challenging and more effort is demanded. We expect our TAT-QA dataset and TAGOP model to serve as a benchmark and baseline respectively to contribute to the development of QA models for hybrid data, especially those requiring numerical reasoning.

Dataset Construction and Analysis
We here explain how we construct TAT-QA and analyze its statistics to better reveal its proprieties.

Data Collection and Preprocessing
In TAT-QA there are two forms of data: tables and their relevant text, which are extracted from real-world financial reports.
In particular, we first download about 500 financial reports released in the past two years from an online website 1 . We adopt the table detection model in  to detect tables in these reports, and apply Apache PDFBox 2 library to extract the table contents to be processed with our annotation tool. We only keep those tables with 3 ∼ 30 rows and 3 ∼ 6 columns. Finally, about 20, 000 candidate tables are retained, which have no standard schema and lots of numbers inside.

3279
The corresponding reports with selected tables are also kept. Note that these candidate tables may still contain errors, such as containing too few or many rows/cols, mis-detected numbers, which will be manually picked out and deleted or fixed during the annotation process.

Dataset Annotation
The annotation is done with our self-developed tool. All the annotators are with financial background knowledge. Adding Relevant Paragraphs to Tables We build valid hybrid contexts based on the original reports kept in the previous step. A valid hybrid context in TAT-QA consists of a table and at least two associated paragraphs surrounding it, as shown in the left box in Figure 1. To associate enough relevant paragraphs to a candidate table, the annotators first check whether there are ≥ 2 paragraphs around this table, and then check whether they are relevant, meaning the paragraphs should be describing, analysing or complementing the content in the table. If yes, then all the surrounding paragraphs will be associated to this table. Otherwise, the table will be skipped (discarded). 3 Question-Answer Pair Creation Based on the valid hybrid contexts, the annotators are then asked to create question-answer pairs, where the questions need to be useful in real-world financial analyses. In addition, we encourage them to create questions that can be answered by people without much finance knowledge and use common words instead of the same words appeared in the hybrid context (Rajpurkar et al., 2016). Given one hybrid context, at least 6 questions are generated, including extracted and calculated questions. For extracted questions, the answers can be a single span or multiple spans from either the table or the associated paragraphs. For calculated questions, numerical reasoning is required to produce the answers, including addition, subtraction, multiplication, division, counting, comparison/sorting and their compositions. Furthermore, we particularly ask the annotators to annotate the right scale for the numerical answer when necessary. Answer Type and Derivation Annotation The answers in TAT-QA have three types: a single span or multiple spans extracted from the table or text, as well as a generated answer (usually obtained through numerical reasoning). The annotators will 3 About two thirds of candidate tables were discarded. also need to label its type after they generate an answer. For generated answers, the corresponding derivations are provided to facilitate the development of explainable QA models, including two types: 1) an arithmetic expression, like (11, 386 -10, 353)/10, 353) for Question 8 in Figure 1, which can be executed to arrive at the final answer; and 2) a set of items separated with "##", like "device ## enterprise services" for Question 4 in Figure 1 where the count of items equals the answer. We further divide questions in TAT-QA into four kinds: Span, Spans, Arithmetic and Counting, where the latter two kinds correspond to the above two types of deviations, to help us better investigate the numerical reasoning capability of a QA model. Answer Source Annotation For each answer, annotators are required to specify the source(s) it is derived from, including Table, Text, and Table- text (both). This is to force the model to learn to aggregate information from hybrid sources to infer the answer, thus lift its generalizability. For example, to answer Question 7 in Figure 1: "How much does the commercial cloud revenue account for the total revenue in 2019?", we can observe from the derivation that "125, 843 million" comes from the table while "38.1 billion" from text.

Quality Control
To ensure the quality of annotation in TAT-QA, we apply strict quality control procedures. Competent Annotators To build TAT-QA, financial domain knowledge is necessary. Hence, we employ about 30 university students majored in finance or similar disciplines as annotators. We give all candidate annotators a minor test and only those with 95% correct rate are hired. Before starting the annotation work, we give a training session to the annotators to help them fully understand our annotation requirements and also learn the usage of our annotation system. Two-round Validation For each annotation, we ask two different verifiers to perform a two-round validation after it is submitted, including checking and approval, to ensure its quality. We have five verifiers in total, including two annotators who have good performance on this project and three graduate students with financial background. In the checking phase, a verifier checks the submitted annotation and asks the annotator to fix it if any mistake or problem is found. In the approval phase, a different verifier inspects the annotation again that has been confirmed by the first verifier, and then approves it if no problem is found.

Dataset Analysis
Averagely, an annotator can label two hybrid contexts per hour; the whole annotation work lasts about three months. Finally, we attain a total of 2, 757 hybrid contexts and 16, 552 corresponding question-answer pairs from 182 financial reports. The hybrid contexts are randomly split into training set (80%), development set (10%) and test set (10%); hence all questions about a particular hybrid context belong to only one of the splits. We show the basic statistics of each split in Table 1, and the question distribution regarding answer source and answer type in Table 2. In Figure 1, we give an example from TAT-QA, demonstrating the various reasoning types and percentage of each reasoning type over the whole dataset.

TAGOP Model
We introduce a novel QA model, named TAGOP, which first applies sequence TAGging to extract relevant cells from the table and text spans from the paragraphs inspired by (Li et al., 2016;Sun et al., 2016;Segal et al., 2020). This step is analogy to slot filling or schema linking, whose effectiveness has been demonstrated in dialogue systems  and semantic parsing . And then TAGOP performs symbolic reasoning over them with a set of aggregation OPerators to arrive at the final answer. The overall architecture is illustrated in Figure 2.

Sequence Tagging
Given a question, TAGOP first extracts supporting evidences from its hybrid context (i.e. the table and associated paragraphs) via sequence tagging with the Inside-Outside tagging (IO) approach (Ramshaw and Marcus, 1995). In particular, it assigns each token either I or O label and takes   (Herzig et al., 2020) and associated paragraphs are input sequentially to a transformerbased encoder like RoBERTa , as shown in the bottom part of Figure 2, to obtain corresponding representations. Each sub-token is tagged independently, and the corresponding cell in the table or word in the paragraph would be regarded as positive if any of its sub-tokens is tagged with I. For the paragraphs, the continuous words that are predicted as positive are combined as a span. During testing, all positive cells and spans are taken as the supporting evidences. Formally, for each sub-token t in the paragraph, the probability of the tag is computed as where FFN is a two-layer feed-forward network with GELU (Hendrycks and Gimpel, 2016) activation and h t is the representation of sub-token t.

Aggregation Operator
Next, we perform symbolic reasoning over obtained evidences to infer the final answer, for which we apply an aggregation operator. In our TAGOP, there are ten types of aggregation operators. For each input question, an operator classifier is applied to decide which operator the evidences would go through; for some operators sensitive to the order of input numbers, an auxiliary number order classifier is used. The aggregation operators are explained as below, covering most reasoning types as listed in Figure 1.
• Span-in-text: To select the span with the highest probability from predicted candidate spans. The probability of a span is the highest probability of all its sub-tokens tagged I. • Cell-in-table: To select the cell with the highest probability from predicted candidate cells. The probability of a cell is the highest probability of all its sub-tokens tagged I.   Figure 1 where the hybrid context is also shown, TAGOP supports 10 operators, which are described in Section 3.2.
• Spans: To select all the predicted cell and span candidates; • Sum: To sum all predicted cells and spans purely consisting of numbers; • Count: To count all predicted cells and spans; • Average: To average over all the predicted cells and spans purely consisting of numbers; • Multiplication: To multiply all predicted cells and spans purely consisting of numbers; • Division: To first rank all the predicted cells and spans purely consisting of numbers based on their probabilities, and then apply division calculation to top-two; • Difference: To first rank all predicted numerical cells and spans based on their probabilities, and then apply subtraction calculation to top-two. • Change ratio: For the top-two values after ranking all predicted numerical cells and spans based on their probabilities, compute the change ratio of the first value compared to the second one.
Operator Classifier To predict the right aggregation operator, a multi-class classifier is developed. In particular, we take the vector of [CLS] as input to compute the probability: where FFN denotes a two-layer feed-forward network with the GELU activation.
Number Order Classifier For operators of Difference, Division and Change ratio, the order of the input two numbers matters in the final result. Hence we additionally append a number order classifier after them, formulated as where FFN denotes a two-layer feed-forward network with the GELU activation, h t1 , h t2 are representations of the top two tokens according to probability, and "avg" means average. For a token, its probability is the highest probability of all its sub-tokens tagged I, and its representation is the average over those of its sub-tokens.

Scale Prediction
Till now we have attained the string or numerical value to be contained in the final answer. However, a right prediction of a numerical answer should not only include the right number but also the correct scale. This is a unique challenge over TAT-QA and very pervasive in the context of finance. We develop a multi-class classifier to predict the scale. Generally, the scale in TAT-QA may be None, Thousand, Million, Billion, and Percent. Tak where h tab and h p are the representations of the table and the paragraphs respectively, which are obtained by applying an average pooling over the representations of their corresponding tokens,";" denotes concatenation, and FFN denotes a two-layer feed-forward network with the GELU activation. After obtaining the scale, the numerical or string prediction is multiplied or concatenated with the corresponding scale as the final prediction to compare with the ground-truth answer respectively.

Training
To optimize TAGOP, the overall loss is the sum of the loss of the above four classification tasks: where NLL(·) is the negative log-likelihood loss, G tag and G op come from the supporting evidences which are extracted from the annotated answer and derivation. We locate the evidence in the table first if it is among the answer sources, and otherwise in its associated paragraphs. Note we only keep the first found if an evidence appears multiple times in the hybrid context. G scale uses the annotated scale of the answer; G order is needed when the groundtruth operator is one of Difference, Division and Change ratio, which is obtained by mapping the two operands extracted from their corresponding ground-truth deviation in the input sequence. If their order is the same as that in the input sequence, G order = 0; otherwise it is 1.

Baselines
Textual QA Models We adopt two reading comprehension (RC) models as baselines over textual data: BERT-RC (Devlin et al., 2018), which is a SQuAD-style RC model; and NumNet+ V2 4 (Ran et al., 2019), which achieves promising performance on DROP that requires numerical reasoning over textual data. We adapt them to our TAT-QA as follows. We convert the table to a sequence by row, also as input to the models, followed by tokens from the paragraphs. Besides, we add a multi-class classifier, exactly as in our TAGOP, to enable the two models to predict the scale based on Eq. (4). Tabular QA Model We employ TaPas for Wik-iTableQuestion (WTQ) (Herzig et al., 2020) as a baseline over tabular data. TaPas is pretrained over large-scale tables and associated text from Wikipedia jointly for table parsing. To train it, we heuristically locate the evidence in the table with the annotated answer or derivation, which is the first matched one if a same value appears multiple times. In addition, we remove the "numerical rank id" feature in its embedding layer, which ranks all values per numerical column in the table but does not make sense in TAT-QA. Similar to above textual QA setting, we add an additional multi-class classifier to predict the scale as in Eq. (4). Hybrid QA Model We adopt HyBrider (Chen et al., 2020b) as our baseline over hybrid data, which tackles tabular and textual data from Wikipedia. We use the code released in the original paper 5 , but adapt it to TAT-QA. Concretely, each cell in the table of TAT-QA is regarded as "linked" with associated paragraphs of this table, like hyperlinks in the original paper, and we only use its cell matching mechanism to link the question with the table cells in its linking stage. The selected cells and paragraphs are fed into the RC model in the last stage to infer the answer. For ease of training on TAT-QA, we also omit the prediction of the scale, i.e. we regard the predicted scale by this model as always correct.

Evaluation Metrics
We adopt the popular Exact Match (EM) and numeracy-focused F 1 score (Dua et al., 2019) to measure model performance on TAT-QA. However, the original implementation of both metrics is insensitive to whether a value is positive or negative in the answer as the minus is omitted in evaluation. Since this issue is crucial for correctly interpreting numerical values, especially in the finance domain, we keep the plus-minus of a value when calculating them. In addition, the numeracy-focused F 1 score is set to 0 unless the predicted number multiplied by predicted scale equals exactly the ground truth.

Results and Analysis
In the following, we report our experimental results on dev and test sets of TAT-QA. Comparison with Baselines We first compare our TAGOP with three types of previous QA models as described in Section 4.1. The results are summarized in Table 3. It can be seen that our model is always superior to other baselines in terms of both metrics, with very large margins over the second best, namely 50.1/58.0 vs. 37.0/46.9 in EM/F 1 on test set of TAT-QA respectively. This well reveals the effectiveness of our method that reasons over both tabular and textual data involving lots 3283 of numerical contents. For two textual QA baselines, NumNet+ V2 performs better than BERT-RC, which is possibly attributed to the stronger capability of numerical reasoning of the latter, but it is still worse than our method. The tabular QA baseline Tapas for WTQ is trained with only tabular data in TAT-QA, showing very limited capability to process hybrid data, as can be seen from its performance. The HyBrider is the worst among all baseline models, because it is designed for Hy-bridQA (Chen et al., 2020b) which does not focus on the comprehensive interdependence of table and paragraphs, nor numerical reasoning.
However, all the models perform significantly worse than human performance 6 , indicating TAT-QA is challenging to current QA models and more efforts on hybrid QA are demanded. Answer Type and Source Analysis Furthermore, we analyze detailed performance of TAGOP w.r.t answer type and source in Table 4. It can be seen that TAGOP performs better on the questions whose answers rely on the tables compared to those from the text. This is probably because table cells have clearer boundaries than text spans to the model, thus it is relatively easy for the model to extract supporting evidences from the tables leveraging sequence tagging techniques. In addition, TAGOP performs relatively worse on arithmetic questions compared with other types. This may be because the calculations for arithmetic questions are diverse and harder than other types, indicating the challenge of TAT-QA, especially for the requirement of numerical reasoning.

Results of TAGOP with Different Operators
We here investigate the contributions of the ten aggregation operators to the final performance of TAGOP. As shown in Table 5, we devise nine variants of the full model of TAGOP; based on the variant of TAGOP with only one operator (e.g. Span-in-text), for each of other variants, we add one more operator back. As can be seen from the table, all added operators can benefit the model performance. Furthermore, we find that some operators like Spanin-text, Cell-in-table, Difference and Average make 6 The human performance is evaluated by asking annotators to answer 50 randomly sampled hybrid contexts (containing 301 questions) from our test set. Note the human performance is still not 100% correct because our questions require relatively heavy cognitive load like tedious numerical calculations. Comparing human performance of F1 in SQUAD (Rajpurkar et al., 2016) (86.8%) and DROP (Dua et al., 2019)) (96.4%), the score (90.8%) in our dataset already indicates a good quality and annotation consistency in our dataset.     Evidence (29%), meaning the model failed to extract the supporting evidence for the answer; (3) Wrong Calculation (9%), meaning the model failed to compute the answer with the correct supporting evidence; (4) Unsupported Calculation (4%), meaning the ten operators defined cannot support this calculation; (5) Scale Error (3%), meaning the model failed to predict the scale of the numerical value in an answer.
We can then observe about 84% error is caused by the failure to extract the supporting evidence from the table and paragraphs given a question. This demonstrates more efforts are needed to strengthen the model's capability of precisely aggregating information from hybrid contexts.
After instance-level analysis, we find another interesting error resource is the dependence on domain knowledge. While we encourage annotators to create questions answerable by humans without much finance knowledge, we still find domain knowledge is required for some questions. For example, given the question "What is the gross profit margin of the company in 2015?", the model needs to extract the gross profit and revenue from the hybrid context and compute the answer according to the finance formula ("gross profit margin = gross profit / revenue"). How to integrate such finance knowledge into QA models to answer questions in TAT-QA still needs further exploration.   (Hermann et al., 2015), SQuAD (Rajpurkar et al., 2016), etc. Recently deep reasoning over textual data has gained increasing attention (Zhu et al., 2021), e.g. multihop reasoning Welbl et al., 2018). DROP (Dua et al., 2019) is built to develop numerical reasoning capability of QA models, which in this sense is similar to TAT-QA, but only focuses on textual data. KB/Tabular QA aims to automatically answer questions via wellstructured KB (Berant et al., 2013;Talmor and Berant, 2018;Yih et al., 2015) or semi-structured tables (Pasupat and Liang, 2015;Zhong et al., 2017;Yu et al., 2018). Comparably, QA over hybrid data receives limited efforts, focusing on mixture of KB/tables and text. HybridQA (Chen et al., 2020b) is one existing hybrid dataset for QA tasks, where the context is a table connected with Wiki pages via hyperlinks. Numerical Reasoning Numerical reasoning is key to many NLP tasks like question answering (Dua et al., 2019;Ran et al., 2019;Andor et al., 2019;Chen et al., 2020a;Pasupat and Liang, 2015;Herzig et al., 2020;Yin et al., 2020; and arithmetic word problems (Kushman et al., 2014;Mitra and Baral, 2016;Huang et al., 2017;Ling et al., 2017). To our best knowledge, no prior work attempts to develop models able to perform numerical reasoning over hybrid contexts.

Conclusion
We propose a new challenging QA dataset TAT-QA, comprising real-word hybrid contexts where the table contains numbers and has comprehensive dependencies on text in finance domain. To answer questions in TAT-QA, the close relation between table and paragraphs and numerical reasoning are required. We also propose a baseline model TAGOP based on TAT-QA, aggregating information from hybrid context and performing numerical reasoning over it with pre-defined operators to compute the final answer. Experiments show TAT-QA dataset is very challenging and more effort is demanded for tackling QA tasks over hybrid data. We expect our TAT-QA dataset and TAGOP model would serve as a benchmark and baseline respectively to help build more advanced QA models, facilitating the development of QA technologies to address more complex and realistic hybrid data, especially those requiring numerical reasoning.