S3HQA: A Three-Stage Approach for Multi-hop Text-Table Hybrid Question Answering

Answering multi-hop questions over hybrid factual knowledge from the given text and table (TextTableQA) is a challenging task. Existing models mainly adopt a retriever-reader framework, which have several deficiencies, such as noisy labeling in training retriever, insufficient utilization of heterogeneous information over text and table, and deficient ability for different reasoning operations. In this paper, we propose a three-stage TextTableQA framework S3HQA, which comprises of retriever, selector, and reasoner. We use a retriever with refinement training to solve the noisy labeling problem. Then, a hybrid selector considers the linked relationships between heterogeneous data to select the most relevant factual knowledge. For the final stage, instead of adapting a reading comprehension module like in previous methods, we employ a generation-based reasoner to obtain answers. This includes two approaches: a row-wise generator and an LLM prompting generator (first time used in this task). The experimental results demonstrate that our method achieves competitive results in the few-shot setting. When trained on the full dataset, our approach outperforms all baseline methods, ranking first on the HybridQA leaderboard.


Introduction
Question answering systems devote to answering various questions with the evidence located in the structured knowledge base (e.g., table) (Pasupat and Liang, 2015;Yu et al., 2018) or unstructured texts (Rajpurkar et al., 2016).Considering that many questions need to utilize multiple sources of knowledge jointly in real-world applications, the hybrid form of question answering over texts and tables (TextTableQA) has been proposed and attracted more and more attention (Chen et al., Walnut Q1: Who is the athlete in a city located on the Mississippi River?A1: Philip Mulkey Q2: In which year did Walnut-born athletes participate in the Rome Olympics?A2: 1960 Q3: Who is the higher scoring athlete from the cities of Eugene and Walnut?Comparison A3: Rafer Johnson 2020b,a; Zhu et al., 2021;Chen et al., 2021;Zhao et al., 2022;Wang et al., 2022a).Fact reasoning (Chen et al., 2020a,b) is a critical question type of TextTableQA.It requires jointly using multiple evidence from tables and texts to reasoning the answers with different operations, such as correlation (e.g., multi-hop) and aggregation (e.g., comparison).Hyperlinks among some table cells and linked passages are essential resources to establish their relationship and support the retrieval and reasoning for multi-hop questions.As shown in Figure 1, answering a complex question Q1 requires jointly reasoning from textual evidence (P1) to table evidence ([R2,Place]) and then to other table evidence ([R2, Athlete]).
Existing methods consist of two main stages: retriever and reader (Chen et al., 2020b;Feng et al., 2022).The retriever filters out the cells and passages with high relevance to the question, and then the reader extracts a span from the retrieval results as the final answer.However, current methods with two stages still have three limitations as follows.
1) Noisy labeling for training retriever.Existing retrieval methods usually ignore the weakly supervised answer annotation (Chen et al., 2020b;Wang et al., 2022b;Feng et al., 2022).For the Q2 of Figure 1, we cannot know the specific location of the hybrid evidence, only given the final answer "1960".Therefore, there is a lot of pseudo-true evidence labeled (Marked in green) automatically by string matching, which introduces a lot of evidence noise.
2) Insufficient utilization of heterogeneous information.After retrieval, existing methods selected a particular cell or passage for reading to extract the final answer (Chen et al., 2020b;Wang et al., 2022b).As for Q1 in Figure 1, previous models were more likely to choose P1 or the coordinates [R2,Place] to extract the answer.However, these methods seldomly used the hybrid information of table schema and cell-passage hyperlinks, which is the key factor in answering multi-hop questions.
3) Deficient ability for different reasoning operations.Previous methods (Eisenschlos et al., 2021;Kumar et al., 2021;Wang et al., 2022b) mainly used an extraction module to obtain answers, which cannot support knowledge reasoning that requires comparison, calculation, and other operations.
In this paper, we propose a three-stage approach S 3 HQA to solve the above problems.(1) Retriever with Refinement Training, we propose a two-step training method, splitting the training data into two parts, so that the noise in the retrieval phase can be alleviated.(2) Hybrid Selector has been proposed and selects supporting facts with different granularity and resources depending on the question type.By considering the hybrid data of tables and text, this paper proposes a hybrid selection algorithm that can effectively utilize the heterogeneous information of tables and passages.(3) Generationbased reasoner utilizes a generation-based model for addressing different question types.The model allows better aggregation of information on the input side, which not only have better multi-hop reasoning capabilities but also be able to handle comparison and counting questions.Furthermore, we are the first to use the LLM in-context learning approach for table-text hybrid question-answering tasks.
We evaluate our proposed model on the challenging TextTableQA benchmark HybridQA.The empirical results show that our approach outperforms all the existing models2 .

Given a natural language question
. The header's number is also N .Some cells have a linked passage P ij .Our goal aims to generate the answer A with model Θ, which is a span from table cells or linked passage or a derivation result of counting questions.

Retriever with Refinement Training
The retriever aims to perform initial filtering of heterogeneous resources.However, accurately labeling the location of answers consumes high labeling costs.For TextTableQA data, the answer A usually appears in multiple locations, which makes it difficult for us to generate precise retrieval la-bels.We use a two-step training method, with a row-based retriever and a passage-based retriever for each step.
Inspired by (Kumar et al., 2021), the retrieval has two steps.First, we divide the data D into two folds according to the string matching labels G i .Specifically, for a question-answer instance, the answer A appears one time as D 1 , and the instance whose answer A appears multiple times as D 2 .Take the example in Figure 1, Q1, Q3 belongs to D 1 while Q2 belongs to D 2 .The data is organized in the form of In the first step, we only use D 1 to train a model Θ 1 , which data are noiseless.Then in the second step, we use the trained weight Θ 1 to train the model Θ 2 .For the input x, the loss function is: where q(z) = p Θ 1 (z|x, z ∈ R) is the probability distribution given by the model restricted to candidate rows R containing the answer span, taken here as a constant with zero gradients (Eisenschlos et al., 2021).
Meanwhile, we use a passage-based retriever to enhance the performance of a row-based retriever (PassageFilter).Specifically, we use the passage-based retriever to obtain a prediction score of passage relevance.Based on this score, we reorder the input of the row-based retriever.It avoids the limitation on input sequence length imposed by the pre-trained model.

Hybrid Selector
This module needs to combine the results of the two granularity retrievers.As for this task, we consider the question type and the relationships between the table and linked passages essential.As shown in Figure 2, the hybrid selector chooses the appropriate data source from the two retrieval results depending on question types.
Specifically, for general bridge multi-hop questions, we use a single row and its linked passage.While for comparison/count questions, we consider multiple rows and further filter the related sentences, delete the linked paragraphs with the low scores.This not only enables the generation module to obtain accurate information, but also prevents the introduction of a large amount of unrelated information.The selector algorithm outputs a mixed sequence with high relevance based on the relationship between the question, the table, and the passages.The algorithm is shown in Algorithm 1.

Generation-based Reasoner
The results of the selector take into account both two granularity.Unlike the previous approaches, which were based on a span extraction module, we use a generation-based model for answer prediction.

Row-wise generator
To generate an accurate answer string A = (a 1 , a 2 , ..., a n ) given the question Q and selection evidence S, we perform lexical analysis to identify the question type, such as counting or comparison, by looking for certain keywords or comparative adjectives.We utilize two special tags Count and Compare , which indicates the question types.We then use the results of the passage retriever to rank the passages in order of their relevance, eliminating the impact of model input length limitations.Finally, we train a Seq2Seq language model with parameters Θ, using the input sequence Q, S and the previous outputs a <i to optimize the product of the probabilities of the output sequence a 1 , a 2 , ..., a n :

LLM prompting generator
With the emergence of large language models, In-Context Learning (Dong et al., 2022) and Chain-of-Thought prompting (Wei et al., 2022) have become two particularly popular research topics in this field.
In this paper, we introduce a prompting strategy for multi-hop TextTableQA.
We utilize selection evidence S and apply LLMbased prompting.We conducted experiments on both vanilla prompting and chain-of-thought prompting in zero-shot and few-shot scenarios.

Experiment Setup
Datasets We conduct experiments on Hy-bridQA (Chen et al., 2020b).The detailed statistics are shown in Appendix A. For evaluation, we followed the official evaluation to report exact match accuracy and F1 score.Implementation details The implementation details are shown in Appendix B. The experimental results are the average of five times results.

Fully-supervised Results
Table 1 shows the comparison results between our models with previous typical approaches on both development and test sets.It shows that our proposed S 3 HQA works significantly better than the baselines in terms of EM and F1 on HybridQA.The results indicate that S 3 HQA is an effective model for multi-hop question answering over tabular and textual data.Specifically, it can effectively handle multi-hop reasoning and make full use of heterogeneous information.
However, we found that our approach was outperformed by the DEHG model (Feng et al., 2022) in terms of F1 score on the Dev set.We speculate that this might be because the DEHG approach uses their own Open Information Extraction (OIE) tool.

LLM-prompting Results
We present our zero-shot and few-shot results in Table 2. "Direct" refers to a simple prompting method where only the question, context, and answer are provided to the model without any additional reasoning process.In contrast, "CoT" involves a human-authored Chain-of-Thought reasoning process that provides a more structured and logical way of prompting the model.The experiments demonstrate that in-context learning used to prompt large language models can achieve promising results.Specifically, utilizing the Chain-of-Thought prompt method can significantly enhance the model's performance.
However, it's worth noting that there is still a performance gap compared to fine-tuning the model on the full dataset (Table 1).Fine-tuning allows the model to learn more specific information about the TextTableQA task, resulting in better performance.Nevertheless, our results show that the LLM-prompting method can be a useful alternative to fine-tuning, especially when there is a limited amount of labeled data available.

Ablation Studies
We conduct ablation studies on the test set.We validate the effects of three modules: retriever with refinement training, hybrid selector, and generation-based reasoner.The retriever performs initial filtering of heterogeneous resources; Selectors combined with hyperlinks further identify the exact evidence needed to answer multi-hop questions; and the reasoner uses the selection evidence to obtain the final answer.Effect of proposed retriever.As shown in the Table 3, under the setting of using the BERTbase-uncased model, sing the BERT-base-uncased model setting, the retriever with refinement training achieved 87.2.When we use Deberta-base, the top1 retrieval performance improved by 0.8%.For w/o refinement training, we use the entire data directly for training, the top1 recall drops about 3.2%.For w/o PassageFilter, we remove the mechanism, the top1 recall drops about 3.2%.For Vanilla-Retriever, we use the row-based retriever (Kumar et al., 2021) and remove all our mechanisms, the top1 score drops about 5.3%.This shows that our model can solve the weakly supervised data noise problem well.

Model
Effect of hybrid selector.As shown in the Table 4, we removed the selector of S 3 HQA and replaced it with the previous cell-based selector (Wang et al., 2022b).This method directly uses the top1 result of the row retriever as input to the generator.w/o hybrid selector shows that the EM drops 2.9% and F1 drops 1.6%, which proves the effectiveness of our selector approach.
Effect of reasoner.As shown in the Table 4, we design two baselines.BERT-large reader (Chen et al., 2020b;Wang et al., 2022b) uses BERT (Devlin et al., 2018) as encoder and solves this task by predicting the start/end tokens.w/o special tags deletes the special tags.Both the two experiments demonstrate our S 3 HQA reasoner performs the best for HybridQA task.

Related Work
The TextTableQA task (Wang et al., 2022a) has attracted more and more attention.As for multi-hop type dataset, previous work used pipeline approach (Chen et al., 2020b), unsupervised approach (Pan et al., 2021), multigranularity (Wang et al., 2022b), table pre-trained language model (Eisenschlos et al., 2021), multiinstance learning (Kumar et al., 2021) and graph neural network (Feng et al., 2022) to solve this task.As for numerical reasoning task, which is quite different from multi-hop type dataset, there is also a lot of work (Zhu et al., 2021;Zhao et al., 2022;Zhou et al., 2022;Lei et al., 2022;Li et al., 2022;Wei et al., 2023) to look at these types of questions.Unlike these methods, our proposed three-stage model S 3 HQA can alleviate noises from weakly supervised and solve different types of multi-hop TextTableQA questions by handling the relationship between tables and text.

Conclusion
This paper proposes a three-stage model consisting of retriever, selector, and reasoner, which can effectively address multi-hop TextTableQA.The proposed method solves three drawbacks of the previous methods: noisy labeling for training retriever, insufficient utilization of heterogeneous information, and deficient ability for reasoning.It achieves new state-of-the-art performance on the widely used benchmark HybridQA.In future work, we will design more interpretable TextTableQA models to predict the explicit reasoning path.

Limitations
Since the multi-hop TextTableQA problem has only one dataset HybridQA, our model has experimented on only one dataset.This may lead to a lack of generalizability of our model.Transparency and interpretability are important in multi-hop question answering.While our model achieves the best results, the model does not fully predict the reasoning path explicitly and can only predict the row-level path and passage-level path.In future work, we will design more interpretable TextTableQA models.

A HybridQA Dataset
HybridQA is a large-scale, complex, and multihop TextTableQA benchmark.Tables and texts

B Implementation Details
B.1 Fully-supervised Setting We utilize PyTorch (Paszke et al., 2019) to implement our proposed model.During pre-processing, the input of questions, tables and passages are tokenized and lemmatized with the NLTK (Bird, 2006) toolkit.We conducted the experiments on a single NVIDIA GeForce RTX 3090.
In the retriever stage, we use BERT-baseuncased (Devlin et al., 2018) and Deberta-base (He et al., 2020) to obtain the initial representations.For the first step, batch size is 1, epoch number is 5, learning rate is 7e-6 (selected from 1e-5, 7e-6, 5e-6).The training process may take around 10 hours.For the second step, we use a smaller learning rate 2e-6 (selected from 5e-6, 3e-6, 2e-6), epoch number is 5.The training process may take around 8 hours.In the selector stage, target row count N S is 3.In the generator stage, we use BART-large language model (Lewis et al., 2020), the learning rate is 1e-5 (selected from 5e-5, 1e-5, 5e-6), batch size is 8, epoch number is 10, beam size is 3 and max generate length is 20.

B.2 LLM-prompting Setting
We use the OpenAI GPT-3.5 (text-davinci-003) API model with the setting temperature = 0 in our experiments.For the few-shot setting, we use 2 shots.To elicit the LLM's capability to perform multi-hop reasoning, we use the text "Read the following table and text information, answer a question.Let's think step by step." as our prompt.
Figure 1: The examples of HybridQA.
Figure2: An overview of S 3 HQA framework.The retrieval stage is divided into two steps.The hybrid selector considers the linked relationships between heterogeneous data to select the most relevant factual knowledge.

Table 1 :
Performance of our model and related work on the HybridQA dataset.

Table 2 :
Performance Comparison of LLM-Prompting Method on Zero-Shot and Few-Shot Scenarios for Hy-bridQA Dataset.
are crawled from Wikipedia.Each row in the table describes several attributes of an instance.Each table has its hyperlinked Wikipedia passages that describe the detail of attributes.It contains 62,682 instances in the train set, 3466 instances in the dev set and 3463 instances in the test set.

Table 5 :
Data Split: In-Table means the answer comes from plain text in the table, and In-Passage means the answer comes from certain passage.