MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering

Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.


Introduction
Question answering (QA) over multiple tables aims to provide exact answers to natural language questions with evidence from one or more tables (Jin et al., 2022). This is in contrast to single-table QA, which has been the focus of tabular QA research to date (Liu et al., 2021;Nan et al., 2021;Zhu et al., 2021;Herzig et al., 2020). Even though groups of related tables are ubiquitous in real-world corpora, such as relational databases or tables in a web page, multi-table QA remains a largely unexplored area. To address this gap, we propose a new task of answering questions over multiple tables. Our multi- It also leads to novel question-types that are unnatural in a single-table setting. For instance, questions involving operations specific to multiple tables, such as Cartesian products (outer joins, inner joins) and set operations (such as intersect, union, in), are unique to and common in a multi-table scenario. Furthermore, such multi-table operations often result in a tabular answer and they necessitate table generation capabilities of the QA model. Figure 1 depicts an example of a question involving two tables, I would like to know the zip code of trips taken above 200 with humidity below 70, and its associated input tables, Weather and trip. A multi-table QA model is expected to disambiguate records from different tables (the question phrase zip code of trips grounds the column zip_code of Table trip; the question phrase humidity below 70 grounds column min_humidity of Table Weather), learn associations among intertable columns (zip_code in both tables) and intratable columns (min_humidity and zip_code in the Weather table), and finally compute the required operations (intersect, count) and generate the tabular answer.
Recent work on tabular QA can be categorized into two major directions: (i) Semantic parsing-based techniques (Pasupat and Liang, 2015;Zhong et al., 2017;Cai et al., 2022), which have been the dominant approach to answering multi--table complex questions. Such methods transform a natural question to a logical form, which is used to query a relational database to extract the answer. However, these techniques are restricted to relational databases and cannot be applied to tables from other sources such over web tables, tables in text documents, and any non-normalized tables. Additionally, they require expensive and expert human annotations (Yu et al., 2018;Lee et al., 2021) formulating SQL queries from natural questions. (ii) Modeling the problem as a sequence generation/classification task (Yin et al., 2020;Zhang et al., 2020;Herzig et al., 2020;Zhu et al., 2021;Liu et al., 2021;Cheng et al., 2021b;Nan et al., 2021;Ma et al., 2022;Pal et al., 2022;Jin et al., 2022), where an end-to-end trained neural model is not only responsible for question/query understanding but also table reasoning. Existing endto-end neural models are either classification-based (Herzig et al., 2020;Zhu et al., 2021), where the model detects the answer span and classifies one table operator associated with the span, or they are sequence generation-based (Nan et al., 2021;Zhang et al., 2020;Liu et al., 2021), where the model generates the answer as a span of text in an auto-regressive manner.
Our work focuses on the latter direction of research. We train a neural model to mimic a semantic parser and generate the answer. A clear distinction of our work compared to existing end-to-end models is that our proposed model, MultiTabQA, does not operate in the constrained setting of a single input table, but can accommodate one or more tables in the input and the associated multi-table operators. Additionally, MultiTabQA performs the task of structured table generation, which imposes structure aspects to the generated output such as table schemas, alignments of rows and columns, relationships between column-headers and column values. Generating structured tables as output requires table-specific evaluation metrics which we define and use to evaluate the generated tables.
To effectively train the model, we generate a pretraining dataset with multi-table SQL queries and tabular answers built over an existing semantic parsing dataset, Spider (Yu et al., 2018). Our dataset consists of 132, 645 samples comprising of SQL queries, associated natural language questions, input tables, and tabular answers. To the best of our knowledge, this is the first work to address the task of multi-

Methodology
We frame multi-table question answering as a sequence-to-sequence task and train an autoregressive transformer encoder-decoder model to generate the tabular answer. Given a question Q consisting of a sequence of k tokens q 1 , q 2 , . . . , q k and a set of N tables, T N = {t 1 , t 2 , . . . , t n }, the goal of the multi-table QA model is to perform chains of operations over T N , constrained by Q, and generate a table T out . The model always generates a table, T out , which can be single celled for scalar answers, single rowed or columned for listbased answers, and multiple rows and columns for tabular answers. In all cases, the model also generates column headers revealing important semantics associated with the generated values.
Training approach. We follow a curriculum learning approach (Bengio et al., 2009) by sequentially increasing the complexity of tasks to train MultiTabQA. The first stage of training is a pretraining step where the training objective is twofold: (i) learn to generate correct tabular answers from SQL, and (ii) understand the associations between related input tables. The final training stage is fine-tuning where the model learns to understand natural language questions with their inherent ambiguity in addition to retaining its ability of reasoning over tables and generating a tabular answer. We discuss the training process in detail in Section 4.   Figure 2: Architecture of MultiTabQA model. Given a natural language question/SQL query and the associated tables as an input sequence, the seq2seq model performs tabular reasoning and generates a tabular answer. Start of an input table is identified with keyword <table_name> which also indicates that the next tokens comprises the table name. col: indicates that the next tokens are table headers. Rows in a table are identified with keyword row i:, columns are separated by |.
Model input/output. The input to the model is a sequence comprised of the query or the natural language question, followed by a sequence of input tables, represented by the table name and the corresponding flattened table.  Figure 2, the i-th table is flattened in row-major format and represented as <table_name>: n 1 n 2 | col: where n 1 , . . . , n 2 is the sequence of table name tokens, h j is j-th column header, r i m is the i-th row and m-th column cell. The boldface words are keywords specifying semantics of the next tokens. The output of the model is also a flattened table in row-major format, i.e., [table ans rep], but without a table name. As depicted in Figure 2, the generated table, [table ans rep], is of the form:

Dataset
To effectively train a multi-table QA model, the dataset needs to cover three aspects: (i) multi-table context, (ii) tabular answers, and (iii) natural questions. Given the absence of large-scale datasets covering all three aspects, we transform existing semantic parsing and single-table QA datasets to focus on a single aspect before training with samples covering all three aspects.

Single table pre-training dataset
One of the sub-tasks of pre-training is to generate tabular answers. We hypothesize that tuning the model to generate tables may lead to a warmstart and better convergence in a multi- cannot be used to query a real SQL database. We insert a placeholder table name in the queries and the corresponding input tables to extract the tabular answer from the database. Transforming the factoid answers to tables leads to single-celled or single-rowed tables. The modified dataset helps the model to understand simple tables and reason over semi-formal queries to generate simple tables.

Multi-table pre-training dataset
We develop a multi- We first adapt the existing samples of Spider for our task. We use the ground-truth SQL queries of Spider as input query for pre-training over multiple tables. We automatically extract all input table names from the SQL query and retrieve the input tables 2 from the relational database. The query, extracted table names, and retrieved tables are inputs to our multi-table QA model. We extract the answer table with the SQL query by querying the relational database. Answer table headers reveal important semantics of the associated column values such as the numeric operation (average, sum, etc.), numeric scales (million, thousand, kms, meters, etc.), or entity facets (name, date, etc.). This process generates 3816 samples comprising of query, question, table_names, tables and answer.
We further augment the modified Spider dataset with 132, 645 samples of synthetic queries. This leads to an augmented multi-table pre-training dataset of 136, 461 unique training samples comprising of 3816 Spider samples and 132, 645 synthetic samples. The validation set comprises of 536 samples from the Spider validation set preprocessed as described above to adapt to our task.
Existing work on semantic parsing (Shi et al., 2020;Yu et al., 2021) have utilized hand-crafted templates to generate large-scale corpora of synthetic queries, but are constrained in their coverage with no multi-table operations (Shi et al., 2020) or limited coverage with no table joins and lacking diversity in set operations (Yu et al., 2021). This motivates us to generate our augmented pre-training dataset for multi-table QA using multi-table SQL templates.
Our synthetic queries are generated from 45 manually crafted templates over the Spider database and hand-crafted rules for operation types. The query templates have placeholders for aggregation, relational operations, table name and headers which are randomly assigned during query generation process. For example, to generate multi-table join queries, we instantiate the templates by randomly choosing tables from a database with at least one common header. For set operations, all tables participating in a multi-table query requires all table headers to match. We design SQL templates in increasing order of complexity starting with simple SQL templates and progressively adding components which increases its complexity. For example, for single- Quality control. We ensure correctness of the synthetic samples by discarding SQL queries that executes to an error or empty table. We also apply the process on the modified Spider, Atis and GeoQuery data to discard SQL query and the corresponding natural language question to ensure that all questions are answerable.

Multi-table QA dataset
We fine-tune and evaluate our model on the natural language questions of semantic parsing datasets: Spider, GeoQuery (Zelle and Mooney, 1996), and Atis (Price, 1990;Dahl et al., 1994). GeoQuery is a semantic parsing dataset to query into a database of United States geography. 3 Atis is a semantic parsing dataset 4 with a collection of 4, 379 questions, corresponding SQL queries and a relational database to a flight booking system (Iyer et al., 2017). Similar to the Spider dataset processing described in Section 3.2, we first extract the input table names from the available SQL queries and query the relational database for the input tables. 5 We also extract the tabular answers using the SQL queries. We discard any samples that executes to an error or empty table. We use the corresponding natural language question for each SQL query as the user utterance for fine-tuning. This results in 6, 715 training samples and 985 validation samples for Spider. We also process the 600 GeoQuery samples provided in (Iyer et al., 2017) to create a subset of 530 training samples, 49 validation samples and 253 test samples. We process and generate an Atis subset of 384 training samples, 45 evaluation samples and 86 test samples. We discard Atis queries with very large input tables (with > 10, 000 rows). This restriction enables us to correctly evaluate question answering capabilities of a model by  Figure 3: Four stage training procedure. The first three stages are pre-training, followed by fine-tuning.
ignoring samples with truncated input sequences including entire input tables from the second table onward. Truncation of tables leads to incorrect answers for any numeric operation such as average, intersect and the evaluation scores would no longer reflect reasoning capabilities of the model.

Training
We follow a curriculum learning approach by sequentially training the model on sub-tasks of increasing complexity as depicted in Figure 3. Broadly, we first pre-train the seq2seq model to mimic a SQL parser and further fine-tune it on the downstream multi-

Fine-tuning
The final stage of training is fine-tuning the pretrained model on natural language questions. Natural questions are ambiguous compared to formal SQL and used at the last stage of training. We finetune the pre-trained model on the 6, 715 natural questions, extracted input and output tables for Spider as described in Section 3 and evaluate on 985 samples of the validation set. To observe the performance of the pre-trained model on out-of-domain database tables, we also fine-tune the pre-trained model on Atis and GeoQuery datasets. For all the fine-tuning datasets, we train for 60 epochs.

Evaluation metrics
While denotation accuracy has been widely used in semantic parsing (Pasupat and Liang, 2015;Zhong et al., 2017;Cai et al., 2022), it is not directly applicable for our task where tabular input encoding, reasoning, and generation are performed by the same model. Evaluating the answer table not only requires matching the generated values but also the table structure. Moreover, tables store factual information such as named entities, dates, numbers, etc in an ordered manner. This makes lexical metrics measuring surface form overlap more suitable than semantic metrics measuring the underlying meaning of paraphrased sequences. Moreover, table components such as rows, columns and cells should not be partially matched with the predictions "United Nation", "United" or "Kingdom". Similarly, a numeric value such as "123.45" should not be partially matched with "12.45", "23.45" or "12". Numeracy pose a challenge to seq2seq models (Nogueira et al., 2021;Pal and Baral, 2021), especially in the extrapolation setting where semantic match of unseen numbers may not be an ideal. Considering all these factors, we focus on lexical match to measure model effectiveness. correctly generated row to be a predicted row that exactly matches any target rows in the target table.

Dataset Model
Row exact match precision is the percentage of correctly generated rows among all the predicted rows in the evaluation dataset. Row exact match recall is the percentage of correctly generated rows among all the target rows in the evaluation dataset.
Column exact match. Unlike rows, which represent records in relational databases, columns represent attributes where column header provides semantic meaning to the values. Hence, a correct column is defined as a generated column that first exactly matches a target column header and further the column values. Column exact match measures ordering of values within a column. Column exact match precision is the percentage of correctly generated columns among all generated columns in the evaluation set. Column exact match recall is the percentage of correctly generated columns among all target columns in the evaluation set.
Cell exact match. Cell exact match is the most relaxed measure of model efficacy at the lowest level of granularity (cells) where table structure is not measured. A cell is correct if it matches any cell in the corresponding target table. Cell exact match precision is the percentage of correctly predicted cells among all predicted cells in the dataset. Cell exact match recall is the percentage of correctly predicted cells among all target cells in the dataset.

Experimental setup and results
We use tapex-base (Liu et al., 2021) as the base model for all our experiments. tapex-base is a single table question answering model (140M parameters) trained to approximate table reasoning by pre-training to mimic an SQL parser. For both the pre-training and fine-tuning process, we use a batch size of 8 and gradient accumulation of 32 to emulate an effective batch size of 256, a learning rate is 1e −9 . The maximum sequence length of both encoder and decoder is set to 1024. We run all our pre-training experiments on four A6000 48GB GPUs and fine-tuning on one A6000 GPU. We observe from Figure 4 that the three stage pre-training leads to a warm-start for fine-tuning and better convergence compared to the baseline tapex-base. For our experiments, we compare the effectiveness of the MultiTabQA model with fine-tuned tapex-base on the 6, 715 natural questions from Spider. The fine-tuned tapex-base acts as baseline for studying the adaptability of state-of-the-art single table model to a multi-table setting. We report the mean scores of 5 training runs initialized with different seeds in Table 1. We conduct statistical significance test (t-test) on the mean scores of the 5 runs and report the significance with p < 0.05 and p < 0.005. We observe that our multi-stage training process leads to improvement in scores on all table exact match accuracy across all datasets compared to fine-tuned tapex-base. The difference in table exact match is highest for GeoQuery where Mul-tiTabQA outperforms tapex-base by 12.38%, Spider by 6.20% and Atis by 1.68%. For F1 and Recall scores on row, column and cell exact match, a similar pattern is observed where MultiTabQA outperforms tapex-base on all datasets. Multi-TabQA outperforms tapex-base by 5.43% on row F1, 7.24% on column F1, and 4.52% on cell F1 for Spider. On GeoQuery, MultiTabQA outperforms by 16.49% on row F1, 12.54% on column F1 and 16.66% on cell F1 scores. All results on Spider and GeoQuery are significant with a p-value less than a critical value of 0.05 indicating strong evidence that MultiTabQA is a superior model. On Atis, we observe that MultiTabQA underperforms on precision but outperforms on recall by a large margin. The difference in recall is larger than precision indicating that MultiTabQA generates more target rows, columns and cells of Atis correctly (higher recall) and hallucinates spurious rows and cells (lower precision). However, the F1 scores are better for MultiTabQA. tapex-base is unable to correctly generate target rows, cells and columns (lower recall), but the few generated ones are correct (higher precision). The low number of test samples (85) of Atis and variations in the hallucinations in different runs makes the precision scores statistically non-significant. However, the recall scores provide very strong evidence (p < 0.005) of the superiority of MultiTabQA in generating correct  Impact of the number of input tables. The number of input tables increases the complexity of the questions and directly impacts the effectiveness of the models. We segregate evaluation on Spider validation set on the basis of number of input tables and compare the results to study the impact of input table number. We observe from Figure 5 that effectiveness reduces as the number of tables increases for both MultiTabQA and tapex-base. However, MultiTabQA fares better than tapex-base when the number of input tables increases. Mul-tiTabQA generates whole tables, rows, columns and cells better than tapex-base as observed in Figure 5a, 5b, 5c and 5d. The gain of MultiTabQA in table exact match for one-table context is around 8.81%, for two-tables context around 4.37%, and it performs similar to tapex-base for three-tables context. It also has a significant higher score on rows, columns and cells, on both single and multitabular context. We also observe that while the column and table EM decreases dramatically when using several tables (Figure 5a and 5c), the row and cell EM does not (Figure 5b and 5d). This indicates that Multi-TabQA can generate rows and cells as effectively in single and multiple input tables settings but fail to do so for columns and consequently for the whole table. This is due to the fact that certain columns in the answer, particularly ones with numbers such as floats, are challenging to generate. The error from the incorrect columns propagates and are accumulated in the table EM leading to a significant drop in performance for multi-table queries.
Ablation on training stages. We perform ablation on the pre-training stages to analyse the contribution of each dataset. The simplest setting is to pre-train with Spider SQL queries, i.e., Stage 2. To evaluate the effectiveness of single table Tapex pre-training samples, the next setting comprises of stages 1 and 2, i.e., pre-train with Tapex pretraining and Spider SQL dataset. The final comparison is with the three-stage pre-training as described in Section 4.1. The results for one run of the experiments are displayed in Table 2. We observe that table exact match is highest for both pre-training and fine-tuning for the three-stage training. Stage 2 fares better than Stage 1+2 on table exact match, and generally has better precision and F1 scores but lower recall. The three-stage pre-training with our synthetic data augmented with Spider outperforms the other settings and confirms the effectiveness of our synthetic data samples in boosting model efficacy.

Conclusion
In this work, we propose a new task of multi-table question answering without intermediate logical forms to fill the gap of existing end-to-end table QA research which focused only on single-table QA. We release a pre-training dataset of 132, 645 samples to effectively train a seq2seq model. We fine-tune and evaluate our model, MultiTabQA, on natural language questions of three datasets: Spider, GeoQuery and Atis, to test the efficacy in a multitable setting. As many multi-table questions result in tables, we train the model to generate tables. This necessitates table-specific metrics at various levels of granularity which we design to evaluate the effectiveness of our model. We demonstrate that such metrics is insightful in understanding model behavior. MultiTabQA outperforms existing state-of-the-art single table QA model fine-tuned to adapt to a multi-table QA setting.

Limitations
Our synthetic pre-training dataset was automatically generated from manual templates, which inspite of dataset creation scalability and low cost, may limit the diversity of the generated SQL queries. Our model, MultiTabQA, requires improvement in numeracy understanding and numeric operations. Real numbers are especially challenging and the model may not be able to correctly generate all the digits of the number correctly rending the generated cell incorrect. Furthermore, large input tables pose a challenge as the input sequence may get truncated beyond the model's maximum sequence length. This has practical limitation in the size and number of input tables which the model can accommodate before truncation which leads to incorrect answers.

Ethical Considerations
The task and model proposed in the paper is aimed at broadening the scope of TabularQA research. All the datasets used in this research, apart from our synthetic data, are publicly available in peerreviewed articles and referenced in this paper. The synthetic SQL dataset we release was generated over a standard benchmark database which has been annotated by 11 Yale students as mentioned in the original paper. Our synthetic samples use templates annotated by the authors of this work and do not use any user-specific data or information. We will be providing open access to our datasets for use in future research under the MIT License. All datasets, including the synthetic pretraining dataset and all datasets adapted for multitable QA will be released. Our model is built over tapex-base which in turn has been trained over bart-base. Our work did not explicitly handle any bias which exists in the aforementioned pretrained models.