Localize, Retrieve and Fuse: A Generalized Framework for Free-Form Question Answering over Tables

Question answering on tabular data (a.k.a TableQA), which aims at generating answers to questions grounded on a provided table, has gained significant attention recently. Prior work primarily produces concise factual responses through information extraction from individual or limited table cells, lacking the ability to reason across diverse table cells. Yet, the realm of free-form TableQA, which demands intricate strategies for selecting relevant table cells and the sophisticated integration and inference of discrete data fragments, remains mostly unexplored. To this end, this paper proposes a generalized three-stage approach: Table-to- Graph conversion and cell localizing, external knowledge retrieval, and the fusion of table and text (called TAG-QA), to address the challenge of inferring long free-form answers in generative TableQA. In particular, TAG-QA (1) locates relevant table cells using a graph neural network to gather intersecting cells between relevant rows and columns, (2) leverages external knowledge from Wikipedia, and (3) generates answers by integrating both tabular data and natural linguistic information. Experiments showcase the superior capabilities of TAG-QA in generating sentences that are both faithful and coherent, particularly when compared to several state-of-the-art baselines. Notably, TAG-QA surpasses the robust pipeline-based baseline TAPAS by 17% and 14% in terms of BLEU-4 and PARENT F-score, respectively. Furthermore, TAG-QA outperforms the end-to-end model T5 by 16% and 12% on BLEU-4 and PARENT F-score, respectively.


Introduction
Question answering is to generate precise answers by interacting efficiently with unstructured, structured, or heterogeneous contexts, such as paragraphs, knowledge bases, tables, images, and various combinations thereof (Burke et al., 1997; Yao   1 Source code will be released at https://github.com/wentinghome/TAGQA.[TAPAS]: The Newcomers Manx Grand Prix race was won by the Spaniard in the first three places with a time of 1:28.22.2 seconds, 1:29.24.8 seconds and a time of 1:29.59.2.
[MATE]: Robert Dunlop won in the first three places, followed by Steve Hislop in the second.
[Ours]: The Newcomers Manx Grand Prix race was won by Robert Dunlop from Scotland Steve Hislop in 2nd place and Ian Lougher in 3rd place at 100.62 mph.
[Reference]: The Newcomers Manx Grand Prix race was won by Robert Dunlop from Steve Hislop in 2nd place and Ian Lougher in 3rd place at an race speed of 100.62.
[Q]: Who won in the first three places of The Newcomers Manx Grand Prix race?
google Figure 1: A motivating example to show the insights of our proposed approach when comparing with several state-of-the-art methods.
and Van Durme, 2014;Talmor et al., 2021;Hao et al., 2017).Among these, question answering on tabular data (TableQA) is a challenging task that requires the understanding of table semantics, as well as the ability to reason and infer over relevant table cells (Herzig et al., 2021;Chen et al., 2020bChen et al., , 2021b)).
For the task of TableQA, from our investigation, most current studies are focusing on the factoid TableQA, in which the answer is in a few words or a phrase copied directly from relevant table cells.In particular, current works on factoid TableQA are mainly categorized into two groups: (1) pipelinebased methods consisting of two stages, i.e., cell retrieval and answer reader (Zhu et al., 2021;Chen et al., 2020a); and (2) end-to-end neural networks such as a paradigm of sequence-to-sequence model that takes the context of question answering (e.g., question and table cells) as input to generate natural-language answers (Li et al., 2021b;Pan et al., 2022;Herzig et al., 2021;Pan et al., 2021;Chen, 2023).
Despite much progress made on factoid TableQA, a contradiction between the factoid TableQA and TableQA exists in real scenarios.In factoid TableQA, the answers are always in a short   form with a few words directly copied from the relevant table cells.However, in real-world scenarios, the answers are expected to be long and informative sentences in a free form, motivating us to target the free-form TableQA in this paper.
It is challenging to generate coherent and faithful free-form answers over tables.(1) The wellpreserved spatial structure of tables is critical for retrieving relevant table cells to the question.Different from factoid TableQA, free-form TableQA with sophisticated question shares less semantic similarities to the table content, while depending more on the spatial structure of tables to infer multiple related cells such that the related cells may be located in a relatively connected area, e.g., from either a few selected rows or columns.(2) The selected table cells, containing the key point, are insufficient for composing the entire coherent sentences.To generate fluent natural-language sentences as answers, external information such as the relevant background knowledge about the question is necessary.(3) It is expected to aggregate and reason from the question, retrieved table cells, and external knowledge to compose a reasonable answer.Given the heterogeneous information, a practical model should be capable of aggregating the information efficiently and generating a coherent and fluent free-form answer.
Figure 1 provides a motivating example to illustrate the insights of this paper.Given a table describing "the 1983 Manx Grand Prix Newcomers Junior Race Results" and a question "Who won in the first three places of The Newcomers Manx Grand Prix race?", the goal is to select relevant cells first and then generate a natural sentence as an answer.From this table, we can observe that the state-of-the-art model TAPAS and MATE only select the "rider" while missing the "rank" column, providing low cell selection coverage.For the overall generation quality, we can observe that both the end-to-end T5 (Raffel et al., 2020) and the pipeline-based TAPAS (Herzig et al., 2020) and MATE (Eisenschlos et al., 2021) are missing key information from the table by merely mentioning part of the three riders.In addition, the TAPAS introduces a hallucinated rider named "Spaniard".These observations motivate us to design a model that can select the relevant cells more accurately and generate faithful answers grounded on the table given a question.
Based on the aforementioned insights, this paper designs a three-stage pipeline framework to tackle the problem of free-form TableQA.Even though the end-to-end TableQA models with high accuracy are prevalently ascribed to the suppression of error accumulated from one-stage training, the long table distracts the model from focusing on relevant table cells, resulting in irrelevant answers.On the other hand, the cell selection module provides a controllable and explainable perspective by extracting a small number of table cells as anchors for the model to generate answers.For the content selection stage, inspired by the recent success of graph models, we convert the table to a graph by designing the node linking and applying a Graph Neural Network (GNN) to aggregate node information and classify whether the table cell is relevant or not.In addition, to generate informative free-form answers, we employ a spare retrieval technique to explore extra knowledge from Wikipedia.Consequently, both the extra knowledge and relevant cells are taken into account to calibrate the pre-trained language model bias.Lastly, we adopt a fusion layer in the decoder to generate the final answer.
To summarize, the primary contributions of this paper are three-fold.(1) To the best of our knowledge, we are the first to convert a semi-structured table into a graph, and then design a graph neural network to retrieve relevant table cells.(2) External knowledge is leveraged to fill in the gap between the selected table cell and the long informative answer by providing background information.(3) Comprehensive experiments on a public dataset named FeTaQA (Nan et al., 2022) are performed to verify the effectiveness of TAG-QA.Experimental results show that TAG-QA outperforms the strong baseline TAPAS by 17% and 14%, and outperforms the end-to-end T5 model by 16% and 12%, in terms of BLEU-4 and PARENT F-score, respectively.

TAG-QA Approach
In this section, we first formulate the problem of TableQA, and introduce the details of our proposed approach TAG-QA.

Problem Formulation
A free-form question-answering task is formulated as generating an answer a to a question q based on a semi-structured table T including table cell content and table meta information such as column, and row header.Different from the factoid table question answering task with a short answer, the free-form QA aims at generating informative and long answers.

Overview
Figure 2 illustrates the overall architecture of our proposed TAG-QA, which is composed of three stages, i.e., relevant

Relevant Table Cell Localization
The initial phase of TAG-QA involves table content selection, a pivotal step that serves as the foundation for subsequent stages.Notably, this stage is of utmost importance as it supplies essential input to the subsequent processes.FeTaQA presents a formidable challenge as a dataset, with a Median/Avg percentage of relevant table cells at 10.7%/16.2%.In order to enhance the precision of the content selection stage, we design a tableto-graph converter to preserve the inherent spatial structure of the tables.We employ GNN to effectively aggregate information at the cell level and subsequently perform a classification task on the table cells.

Table-to-Graph
Converter State-of-the-art models prefer to adopt the pre-trained Language Models (LMs) to make predictions by transforming the semi-structured table into natural sentences using a pre-defined template .However, they lose the table structure information and deteriorate the performance of downstream tasks.
TAG-QA designs a table-to-graph converter to transform a table into a graph, preserving the table structure by identifying the cell-to-cell relations.Figure 3 shows an example of transforming a table into a graph.For the i-th row, we add an empty row header as rhi which reflects the entire row information.All the table cells from the same row are fully connected, and all the table cells from the same column are also fully connected.Besides, we design two types of relations for the table graph, i.e., "of the same row" and "of the same column" relations.In particular, "of the same row" relation captures the entity information, while "of the same column" relation reveals the connection of the same attribute.
In addition, to incorporate the question node into the graph, we create a question node and assign a linking edge between the question and each table cell with the relation "question to cell".
TAG-QA Content Selection Inspired by QA-GNN (Yasunaga et al., 2021), we propose a content selection module (TAG-CS) that retrieves relevant table cells from the table-based graph.TAG-CS takes the converted table graph from Sec. 2.3 as input, and outputs the question-related table cells.TAG-CS reasons over the table cell level, and each graph node represents a table cell.To fully explore the table semantic and the spatial information, TAG-CS acquires the initial graph node embedding through a pre-trained LM e.g., BERT.Besides, the pre-trained LM and GNN are jointly trained to predict the selected cells.

GNN Architecture
We use Graph Attention Network (GAT) (Veličković et al., 2017) which leverages masked self-attention layers and employs iterative message passing among neighbors is applied to predict the selected graph node.GAT follows Eq. 1 to update the i-th node feature h l i ∈ R D at layer l through gathering the weighted attention among its neighbors N i .
where α st and m st ∈ R N are the self-attention weight and the message passed from source node s to target node t respectively, and f g is a 2-layer Multi-Layer Perceptron (MLP) with batch normalization.The message m st ∈ R N from node v s to v t is computed using Eq. 2.
where u s ∈ R T /2 is the source node s feature linearly transformed from the one hot vector node type u t .r st ∈ R T is the relation feature from source node s to target node t computed through a 2-layer MLP by taking relation type, source, and target node type into account.f m is a linear transformation.
The self-attention coefficient α st is updated in Eq. 3. Query and key vectors are linearly transformed by g q and g k , as node, edge feature, and the previous layer hidden state provided.
GNN Training and Inference Given a question q and a table T , TAG-CS reasons over a graph containing both the table cell nodes and the question node by making predictions on the row and column level.We observe that relevant table cells tend to show up in a relatively connected area, thus we make predictions over row and column headers and choose the intersection area.Compared to predicting over the cell level which results in low recall, our method gains a higher chance to capture relevant table cells.For the training stage, TAG-CS maximizes the cross entropy to predict the row and column for relevant cells.

External Knowledge Retrieval
TAG-QA is the first attempt to leverage the external knowledge to address the table-based freeform QA task.TAG-QA adopts an effective and simple Spare Retrieval based on the TF/IDF approach to select a potentially relevant context from Wikipedia.
Sparse Retrieval For TAG-QA, the external knowledge is served as a complimentary background context for the next table and text fusion stage.We choose the spare retrieval method using BM25 (Robertson and Zaragoza, 2009) as a ranking function to retrieve the most relevant text as supplementary information.Given a query q with m keywords k 1 , k 2 , . . ., k m , the BM25 ranking score p i for document d i is calculated by Eq. 6, Precision Recall F-1 TAPAS (Herzig et al., 2020) 65.31 24.20 35.32 MATE (Eisenschlos et al., 2021) 56 where idf is the Inverse Document Frequency (IDF), tf (k j , d i ) is the term frequency of the keyword k j in document d i , and L D is the average document length.

Table-Text Fusion
After obtaining the predicted highlighted table cells from the table as well as the support context from Wikipedia, TAG-QA aggregates and combines the two information sources through a sequence-tosequence model Fusion-in-Decoder (FiD) (Izacard and Grave, 2021).FiD appends the question to each information source, encoding each component independently.It subsequently merges all source features and transmits them to the decoder.

Experiments and Analysis
In this section, we explore the following experimental questions: (1) Does proposed TAG-QA generate a more coherent and faithful answer compared with the baseline?(2) Is table cell selection, knowledge retrieval, and fusion necessary for the free-form TableQA?(3) Is it promising to keep enhancing the three modules of TAG-QA?

Dataset
This paper focuses on tackling the challenge of generating long free-form answers, rather than the short factoid responses.Consequently, we have opted for the utilization of the state-of-the-art dataset, FetaQA (Nan et al., 2022), as our testbed.
The training dataset comprises 7,327 instances, while the development and test sets encompass 1,002 and 2,004 examples, respectively.

Implementation Details
TAG-CS) TAG-CS applies BERT checkpoint "bert-based-uncased" to learn the table cell representation.For the BERT model, we set the learning rate to 1e-6 and impose a maximum token length of 35 for each cell.Subsequently, the acquired table cell-level embeddings serve as input node features for our GNN.Within the TAG-CS framework, our GNN module comprises 3 layers, each with node features of 200 dimensions.Additionally, we apply a dropout rate of 0.2 to each layer for regularization.We train our model on the FeTaQA dataset, configuring it to run for a maximum of 50 epochs.We employ the RAdam optimizer (Liu et al., 2019) with a weight decay of 0.01, utilizing a powerful 24G memory Titan-RTX GPU.To optimize GPU memory usage, we set the maximum number of table cells as 200 and set the batch size as 1.The selection of the best checkpoint is based on the performance of the model on the development set, which is then used for decoding the test set.Additionally, to enhance efficiency, TAG-CS is employed to select intersection cells from the top 3 rows and 3 columns as the relevant cells, drawing upon our accumulated experience in this context.Sparse Retrieval) Our implementation relies on the PyTorch-based toolkit Pyserini, designed for reproducible information retrieval research using both sparse and dense representations.We utilize the question as the query to retrieve pertinent contextual information from Wikipedia, selecting the first sentence from the top results.We specifically employ the Lucene Indexes, denoted as "enwikiparagraphs"2 .FiD) In the context of FiD, TAG-QA employs the Adam optimizer with a learning rate of 1e-5.We select the best checkpoint for inference purposes.In the inference phase, we utilize beam search with a beam size of 3 and apply a length penalty of 1 when generating answers.

Baselines
To validate the effectiveness of TAG-QA, we choose two different types of methods as baselines, including end-to-end and pipeline-based models.Table 2: Results on FeTaQA dataset."P/R/F" denotes the precision/recall/F score.We report end-to-end model UniLM, BART and T5, and the pipeline results.The results of various table cell selection strategies TAPAS, MATE and our proposed TAG with T5 as backbone generation model are noted as TAPAS-T5, MATE-T5 and TagQA-T5.To validate the effectiveness of proposed framework components, we test different combinations of source information to models where "Q" is question, "Retrieve" is the retrieved external knowledge, "fullTab" is full table, and "predCell" refers to the selected table cell.And the last row TAGQA-FiD is the proposed method.
Firstly, we compare TAG-QA with strong stateof-the-art end-to-end pre-trained generative LMs.UniLM (Dong et al., 2019), BART (Lewis et al., 2020), and T5 (Radford et al., 2019).For the input format to the end-to-end model, we flatten the table by concatenating special token [SEP] in between different table cells, and concatenate with the question as a natural sentence, e.g."question [SEP] flattened table".Furthermore, we compare the performance of our proposed model with pipelinebased methods which include two stages: content selection and answer generation.Content selection makes predictions of relevant cells.We choose two table-based pre-training models: TAPAS (Herzig et al., 2020) and MATE (Eisenschlos et al., 2021).Moreover, T5 is chosen as the baseline model's answer generation backbone due to the integration capacity for the table cell and retrieved knowledge.

Automatic Evaluation Metrics
We use various automatic metrics to evaluate the model performance.Due to the pipeline style of TAG-QA, we report two sets of metrics for content selection and answer generation stages respectively.Firstly, to evaluate the retrieval competency of the table semantic parser, we report Precision, Recall, and F1 scores.Besides, to evaluate the answer generation quality, we choose several automatic evaluation metrics, i.e., BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004) and METEOR (Banerjee and Lavie, 2005), to evaluate the n-gram match between the generated sentence and the reference answer.Considering the limitation that those metric fails to reflect the faithfulness answer to the fact from the table, we report PARENT (Dhingra et al., 2019) and PARENT-T (Wang et al., 2020) score.PARENT score takes the answer matching with both the reference answer and the table information into account, while PARENT-T focuses on the overlap between the generated answer with the corresponding table.

Results
We first evaluate the TAG-CS content selection stage table semantic parsing results, as shown in Table 1.For the F-1 score, TAG-QA outperforms the strong baseline model TAPAS and MATE by 9.9% and 13.27%.For recall, TAG-QA achieves the best result, demonstrating that TAG-QA retrieves more relevant table cells.For precision, the baseline model outperforms TAG-QA by retrieving fewer cells which includes more relevant cells.However, the low precision and high recall are a trade-off since the relevant cells make a stronger impact on the overall answer generation quality.Thus, we can tolerate a small amount of irrelevant cells and keep the correct cells as many as possible.
In addition, Table 2 shows the measurements of generated answer quality using TAG-QA compared to previous both end-to-end and pipelinebased state-of-the-art models.From overlappingbased metrics BLEU-4, METEOR, and ROUGE-L, TAG-QA outperforms all the end-to-end and pipeline-based models.Specifically, TAG-QA gains 14.27%/1.86%/9.93%more than the best endto-end model UniLM in "Q-fullTab" while gains 14.76%/8.98%/13.88% in "Q-predCell" setting, more than the best pipeline-based model TAPAS.For faithfulness metric PARENT and PARETN-T, TAG-QA provides the best performance among the pipeline models by outperforming TAPAS on the "Q-predCell" setting by 13.92% and 3.88% on PARENT and PARENT-T.Compared with endto-end models, TAG-QA gives the best PARENT score while UniLM shows the best result regarding PARENT-T.It's explainable because TAG-QA incorporates information outside of the table to generate answers, achieving a trade-off between being grounded on the table and synthesizing informative answers.
Furthermore, to answer Question 2 "Are three stages of the framework necessary to generate highquality answer?",we conduct an experiment in Table 2 by comparing the T5 model "Q-fullTab" with pipeline methods backend by T5 using "Q-predCell".The result shows proposed TAG for content selection TAGQA-T5 selecting 7% of table cell outperforms T5 with fullTab.This indicates the table cell selection is necessary since relevant cells provide an anchor to generate high answer generation.Moreover, to investigate the effect of retrieval knowledge, we show results in knowledge enhances model performance by providing background knowledge.The proposed model TAGQA-T5 provides the best result by integrating retrieval and informative selected cells.Lastly, our fusion module further enhanced the overall performance by aggregating tables and text efficiently.
Last but not least, to answer the question "Is there space to further enhance performance using this framework ?", we conduct an oracle experiment shown in "Oracle-T5".With the simple Retrieval technique, T5 backend generation, and oracle table cell, the BLEU-4 result is 31%, and PARENT, PARENT-T are over 30%.If a better retrieval and fusion model is used, the model performance can be further boosted.

Analysis
To further evaluate the quality of generated answer by various state-of-the-art models when compared to the ground-truth answer, we perform an additional human evaluation.Besides, we conduct an ablation study for TAG-QA to validate the three building blocks: jointly training of LM and GNN for TAG-CS, external context retrieved from Wikipedia, and FiD model.Furthermore, a case study is presented which shows different answer qualities produced by various models.
Human Evaluation Following (Nan et al., 2022), we recruit three human annotators who pass the College English Test (CET-6)3 to judge the quality of the generated sentence.We randomly draw 100 samples from test examples in FeTaQA dataset and collect answers from TAG-QA and baseline models.Then, we present the generated answers to three human annotators without revealing the name of the model, thus reducing human variance.
We provide instructions for human raters to evaluate the sentence quality from four aspects: faithfulness, fluency, correctness, and adequateness.For each aspect, an annotator is supposed to assign a score ranging from 1 (worst) to 5 (best) based on the answer quality.The "overall" column refers to the average ranking of the model.First, for fluency, the annotator checks if an answer is natural and grammatical.Second, for correctness, we compare the answer with the ground truth by checking if the predicted answer contains the correct information.Third, adequacy reflects if an answer contains all the aspects that are asked.Finally, faithfulness evaluates faithfulness if an answer is faithful and grounded to the contents of the highlighted table region such that it covers all the relevant information from the table while not including other key information outside of the table.From Table 3, we can see TAG-QA ranked the top among all models.
Ablation Study To figure out which building blocks are driving the improvements, we examine different ablated models to understand each component of TAG-QA, including joint training of BERT and GNN from TAG-CS, sparse retrieval, and FiD.Table 4 presents the ablation results under different evaluation metrics.We can see that the model performance drops when any component is removed.Especially, ablating the sparse retrieval module results in the most drop in BLEU-4 and PARENT scores, while removing FiD causes the most significant drop in PARENT-T.
Case Study To inspect the effect of TAG-QA directly, we present a case study in Figure 4, where a sampled table, question, ground-truth relevant table cells (highlighted in blue), the predicted answers of models, as well as the reference are provided.First, we find that the end-to-end model generally contains more information than pipeline models due to the more abundant table information while they suffer from hallucination.For example, T5 and BART identify the ranking position of "Leandro de Oliveira" as "17th" while it should be "73rd" from the table.Second, for pipeline models, they tend to generate irrelevant information e.g.MATE mentions the duration and points instead of answering the ranking position and the event.Third, both the end-to-end and pipeline models (TAPAS) fail to cover all the relevant information from the table, e.g.UniLM did not capture the event 12km, and TAPAS fails to mention the position 73rd.By contrast, TAG-QA provides the highest table coverage while keeping the fluency of sentences.

Related Work
In this section, we review the related work to ours from the perspectives of TableQA, GNN for natural language processing, and knowledge-grounded text generation.
TableQA FeTaQA is the first TableQA dataset that addresses the significance of free-form answer generation, while most current research work including WikiTableQuestions (Pasupat and Liang, 2015), Spider (Yu et al., 2018), HybridQA (Chen et al., 2020b), OTT-QA (Chen et al., 2020a), and TAT-QA (Zhu et al., 2021) focuses on the short factoid answer generation.The early solution (Zhong et al., 2017;Liang et al., 2017) of addressing the TableQA is to parse the natural question into a machine-executable meaning representations that can be used to query the table.To reduce the laborintensive logical annotation, a semantic parser trained over weak supervision from denotations has been drawing attention.Plenty of Transformerbased table pre-traininig models demonstrate decent TableQA performance, e.g., TaPas (Herzig et al., 2020), MATE (Eisenschlos et al., 2021), TaBERT (Yin et al., 2020), StruG (Deng et al., 2021), GraPPa (Yu et al., 2021), and TaPEx (Liu et al., 2022a).In addition, rather than explore table structure, RCI (Glass et al., 2021) assumes the row and column are independent, and predicts the probability of containing the answer to a question in each row and column of a table individually.
GNN for Natural Language Processing Apart from the extensively renowned causal language models that have showcased impressive results in various task (Vaswani et al., 2017;Parmar et al., 2018;Wang et al., 2023aWang et al., , 2022Wang et al., , 2023b)), a rich variety of language processing tasks gain improvements from exploiting the power of GNN (Li et al., 2015).Tasks such as semantic parsing (Chen et al., 2021a), text classification (Lin et al., 2021), text generation (Fei et al., 2021), question answering (Wang et al., 2021;Yasunaga et al., 2021) can be expressed with a graph structure and handled with graph-based methods.In addition, researchers apply GNN to model the text generation from structured data tasks e.g.graph-to-sequence (Marcheggiani and Perez-Beltrachini, 2018), and AMR-totext (Ribeiro et al., 2019).
Knowledge-Grounded Text Generation Encoder-decoder-based models have been proposed to tackle the generation task by mapping the input to the output sequence.However, the input text is insufficient to provide knowledge to generate decent output due to the lack of commonsense, factual events, and semantic information.Knowledge-grounded text generation incorporating external knowledge such as linguistic features (Liu et al., 2021c), knowledge graph (Liu et al., 2021b;Li et al., 2021a), knowledge base (Eric and Manning, 2017;He et al., 2017;Liu et al., 2022b), and textual knowledge (Liu et al., 2021a;Zhao et al., 2021) help to generate a more logical and informative answer.

Conclusion
This paper presents a generalized pipeline-based framework TAG-QA for free-form long answer generation for TableQA.The core idea of TAG-QA is to divide the answer generation process into three stages: (1) transform the table into a graph and jointly reason over the question-table graph to select relevant cells; (2) retrieve contextual knowledge from Wikipedia using sparse retrieval, and (3) integrate the selected cells with the content knowledge to predict the final answer.Extensive experiments on a public dataset FeTaQA are conducted to verify the generated answer quality from both the fluency and faithfulness aspects.

Limitations
One limitation of TAG-CS, which accepts the entire table as input, arises when dealing with large tables, as training both BERT and the graph model simultaneously becomes challenging due to GPU memory constraints.Consequently, one promising avenue for future research involves the efficient modeling of large tables.Furthermore, it's worth noting that the availability of only one public dataset, FeTaQA, for free-form TableQA, has constrained our validation efforts to this single dataset.However, we are committed to expanding the scope of our research in the future by evaluating the performance of our pipeline model, TAG-QA, across multiple free-form TableQA datasets.
Who won in the first three places of The Newcomers Manx Grand Prix race?[A]: The Newcomers Manx Grand Prix race was won by Robert Dunlop from Steve Hislop in 2nd place and Ian Lougher in 3rd place at an race speed of 100.62.Text Knowledge Retrieval( §2.3) Scotland's Steve Hislop finished second with 101.27 mph and Wales' Ian Lougher finished third with 100.62 mph.

Table Figure 2
: An overview of TAG-QA.The input to TAG-QA is a combination of one table and question, while the output is an answer.The top box shows the content selection process which first converts the table to a graph and selects relevant nodes using GNN.The middle box shows the process of using the spare retrieval technique to retrieve relevant text as complementary information.The rightmost blue box is to integrate the selected cells and retrieved texts to generate the final answer.

Table 1 :
Content selection results on FeTaQA dataset.

Table 4 :
Table 2 by concatenating "Retrieval" to the input.The retrieval Ablation study of the proposed model.We examine the ablated mode by removing the Joint Training (JT) of TAG-CS, Sparse Retrieval (SR), and FiD.
Leandro de Oliveira represented Brazil at the 2011 World Cross Country Championships and placed 73rd in the 12 km race.[REF]:LeandrodeOliveira represented Brazil at the 2011 World Cross Country Championships and placed 73rd in the 12 km race.Figure4: A case study from FeTaQA for qualitative analysis.The highlighted cells are the ground-truth relevant table cells."RB" refers to "Representing Brazil".Hallucinated content from the predicted answer is marked in red and the correct content in blue.