MuGER 2 : Multi-Granularity Evidence Retrieval and Reasoning for Hybrid Question Answering

,


Introduction
Traditional knowledge-based question answering systems derive answers to questions based on homogeneous knowledge, including knowledge graphs (Bao et al., 2016;Gu et al., 2021), tables (Jauhar et al., 2016;Zhang et al., 2020), or passages (Zhao et al., 2021;Zhou et al., 2022a) and achieve remarkable performances.However, they neglect a more general scenario requiring reasoning over heterogeneous data to answer a question.To investigate this challenge, Chen et al. (2020) propose the hybrid question answering (HQA) and publish a corresponding dataset, HybridQA.Specifically, each question of HybridQA is aligned with a table and passages linked to table cells (abbreviated as links) as knowledge.The answer to the question may come from either table cells or links.Figure 1 illustrates two examples of HybridQA.For the first one, the answer 'Philip Mulkey' comes from the cell, while for the second one, its answer 'Mississippi River' is derived from the link.
Existing HQA models usually consist of two components, a retriever to learn evidence and a reasoner to leverage the evidence to derive the answer.Intuitively, the heterogeneous data shown in Figure 1 can provide different granularity evidence to the HQA models, such as coarse-grained column and row and fine-grained cell and link.Previous models always retrieve coarse or fine-grained evidence and directly use a span-based reading comprehension model to reason the answer.For example, Kumar et al. (2021) choose a coarse-grained region as the evidence, e.g., a table row.On the contrary, Chen et al. (2020) and Sun et al. (2021) focus on the fine-grained units, table cell and link.Intuitively, compared with fine-grained evidence, coarse-grained evidence is easier to be accurately retrieved.However, it contributes less to the reasoner's performance since it contains more noisy information.The fine-grained evidence is just the The retriever and reasoner performance with different granularity evidence.Note that the F 1 score in this figure is calculated based on the oracle evidence to evaluate the answer reasoning performance.Intuitively, R@1 × F 1 reflects the end-to-end HQA performance.
opposite for the retriever and the reasoner.
To prove this argument, we conduct experiments on the development set of HybridQA to analyze the retriever and the reasoner performance2 with different granularity evidence.The results of evidence retrieval and answer reasoning are evaluated with the retrieval recall (R@1) and the F1 score, respectively.As shown in Figure 2, the coarse-grained evidence achieves better retrieval recalls but lower F1 scores than the fine-grained evidence.With the granularity changing from coarse to fine, the F1 score increases while the R@1 declines.The results are consistent with our hypothesis.
To preserve the advantage and eliminate the disadvantage of different granularity evidence, we propose MuGER 2 , a Multi-Granularity Evidence Retrieval and Reasoning approach for HQA.In the retrieval stage, a unified retriever is designed to learn the multi-granularity evidence from the heterogeneous data involving columns, rows, cells, and links.Compared with existing methods, the multi-granularity evidence preserves the retrieval recall and provides more aspects of information to reason the answer.In the reasoning stage, to avoid the redundant information in the multi-granularity evidence that lowers the answer reader's performance, an evidence selector (E-SEL) is designed to navigate the fine-grained evidence for the reader by fusing the learned multi-granularity evidence.Together with our multi-granularity evidence retrieval and reasoning designs, our MuGER 2 preserves both the retrieval recall and the F1 score and further boosts the end-to-end HQA performance.
We conduct extensive experiments on the Hy-bridQA dataset to verify the effectiveness of our proposed MuGER 2 .Experiment results show that MuGER 2 achieves remarkable improvements, which outperforms a publicly available strong baseline by 10.0% EM and 12.9% F1 scores on the end-to-end HQA performance.Ablation studies verify the effectiveness of the proposed multigranularity evidence retrieval and the evidence selector.Moreover, results compared with different mono-granularity baselines prove the effectiveness of multi-granularity evidence for HQA.
Our contributions are as follows: (1) We analyze the limitation of HQA systems which solely rely on only one certain granularity evidence in both the retriever and reasoner and propose a multi-granularity evidence retrieval and reasoning approach, which boosts the end-to-end HQA results.(2) We propose a joint retrieval method that improves the evidence retrieval performance and an evidence selector to accurately navigate the fine-grained evidence from the multi-granularity information to preserve the reader's performance.(3) We conduct extensive experiments on the HybridQA dataset and show the effectiveness of MuGER 2 .

Task: Hybrid Question Answering
The hybrid question answering (HQA) aims to tackle the answer reasoning over heterogeneous data, as shown in Figure 1.The input of this task includes a question Q, a table T , and a set of links L. Specifically, the table T consists of cells {c i,j } i=M,j=N i=1,j=1 , where M and N are the numbers of rows and columns.Each column has a header h j to describe the field of the cells {c i,j } i=M i=1 .A table cell c i,j may link to a subset of links L i,j ⊆ L. Given the input, an HQA system aims to return the answer text A of the question Q by reasoning on the heterogeneous information.In this work, In-Table and In-Passage respectively represent that the answer is in a cell or in a link.

Model Overview
As described above, coarse-grained evidence is easier to be derived by the retriever and it makes less contribution to the reasoner while the fine-grained is just the opposite.To preserve their advantage and get rid of their disadvantage, we propose MuGER 2 , a multi-granularity evidence retrieval and reasoning approach.As shown in Figure 3, in the retrieval   stage, MuGER 2 adopts a unified retriever to learn the multi-granularity evidence.In the reasoning stage, an evidence selector is designed to navigate the fine-grained evidence for the reader based on the multi-granularity evidence.The details of these modules are introduced in the following sections.

Stage-1: Multi-Granularity Evidence Retrieval
In this section, we first introduce the definition of the multi-granularity evidence and then introduce the retrieval model of MuGER 2 .

Multi-Granularity Evidence
According to the characteristics of the HQA, we define the multi-granularity evidence as E. E involves four kinds of evidence with different granularity which is denoted as E= E col , E row , E cell , E link .
Specifically, E col indicates the column which includes the answer; E row represents the row that contains the answer; The evidence E cell means the cell that contains the answer in its cell value; and E link indicates the link containing the answer.

Evidence Retriever
To integrate the above mentioned four kinds of evidence into our method, we adopt their joint probabilities to model the evidence retriever.Formally, given a question Q and the multi-granularity evidence candidate E derived from T and L, the evidence retriever can be formulated as follows: where t ∈ {col, row, cell, link}, E t denotes tkind element of E, and p e (E t |Q, T , L) represents the probability that E t is the evidence.To compute p e (E t |Q, T , L), we further introduce a retrieval score s t for E t , which weighs the possibility that E t contains the answer, and adopt it as follows: where Z t is a normalization factor and equal to the sum of retrieval scores of all t-kind evidence.
Given that for different evidence Z t is consistent, we omit it in our implementation, and only use the retrieval score for evidence selection.
To compute all retrieval scores for the multigranularity evidence E, a straightforward method is to train four separate models and predict the different granularity evidence respectively.However, this method neglects relations among different granularity evidence.To tackle this challenge, we propose to adopt a unified framework for all kinds of evidence and train it in a joint way.
Specifically, we first utilize BERT (Devlin et al., 2019) as the encoder to learn the representation for each evidence.Formally, given evidence E t , we concatenate t, Q, and E t with "[SEP]" into a sequence and prepend "[CLS]" at its start.Subsequently, we feed the sequence into BERT.Finally, we adopt the max-pooling operation over all tokens of the BERT output to obtain the sequence representation h t .This process is denoted as follows: where t is inserted into the input sequence to indicate the kind of the evidence.After obtaining the representation of E t , we feed it into a linear projection followed by the sigmoid function to compute the retrieval score as follows, where W ∈ R K is a K-dimensional trainable parameter.

Content of Different Evidence
For E t , it refers to different content when t denotes different kinds of evidence as follows.Notably, the granularity of certain evidence is determined by what kinds of candidates are ranked, e.g., cells or rows, but NOT what content a candidate uses.
Column Column evidence aims to take the meaning of a specific column header in the table into account.The column header usually describes the fields of an answer.Therefore, we utilize the table header as column evidence.Namely, E col = h j .
Link Link evidence indicates a passage that a table cell links to.It is another form of knowledge in the HQA task and we assign each E link candidate a specific passage.
Cell We formalize cell evidence E cell as a concatenation of its header h j , its neighbors in the same row {h j , c i,j } j=N j=1 , and its links L i,j .Namely, where "[SEP]" is the separator.In cell encoding, the above neighbor information refers to the values of other cells in the same row without their linked passages.Neighbor information and the corresponding header of a cell are used as its positional information (row, column) which is essential to distinguish cells, especially for those with the same value but in different positions.This process is also applied to the modeling of row evidence.
Row Traditional methods usually concatenate all row cells into an input sequence to score a table row evidence.However, in the HQA task, a table row may link to several passages, and directly concatenating all these contents will result in an overlength of input sequences.Therefore, we score a row based on its cells.Therefore, to encode and score a row candidate, we first encode all its cells separately and then aggregate the cell scores by maximization operation to obtain the row score.Namely, E row includes a set of E cell .

Training
Joint Training for Multi-Granularity As mentioned above, we only adopt the retrieval score for the evidence selection.Therefore, the core of training of evidence retriever is to learn a model to produce the retrieval score.Since the retrieval score s t means the probability that the evidence E t contains the answer, we treat the retrieval score prediction as a binary classification and use the binary cross entropy (BCE) loss as the training objective to learn the model as follows: where y t ∈ {0, 1} is the label of E t indicating whether the evidence E t contains the answer.Note that, since only the answer texts are annotated in the HybridQA dataset, we obtain the evidence labels by distant supervision method like Chen et al.
(2020) which basically follows the principle that the positive evidence contains the answer.Furthermore, to consider the relative information among different granularity evidence, we propose to leverage a unified retriever to model them and learn it in a joint way:

Contrastive Learning within Each Granularity
The above joint training method is capable of capturing the relation among different granularity evidence.However, it neglects the relations among different candidate evidence within the same granularity.Because it depends on the BCE loss, which regards each candidate evidence as an individual instance and does not require any involvement of other instances besides the given evidence.To alleviate this problem, we propose to leverage contrastive learning (CL) (Hadsell et al., 2006) within candidate evidence with the same granularity to model their relative similarities and further enhance the model's representation and scoring capability.
To construct the training objective, we first define a function sim(•) to denote the similarity between different evidence as follows: It also requires building positive and negative instances for a given evidence candidate.For positive instance, following Gao et al. (2021), we set E + t = E t .Since there are dropout masks placed on fully-connected layers as well as attention probabilities during training of BERT, feeding the same input to the encoder twice we get two different representations h t and h + t for it.To obtain the negative samples E − t for the given evidence, like most existing work with contrastive learning (Gao et al., 2021), we regard other evidence in the same batch as negative ones and the representation of E − t is denoted as h − t .After that, we take a cross-entropy objective with in-batch negatives.Formally, given a batch, we split it into four groups and each one corresponds to a granularity.Let D denote all instances in a group, the training objective is: where τ is a temperature hyperparameter.Actually, we compute four loss as Eq. ( 8) based on four groups and jointly learn them as follows: Finally, we combine Loss bce and Loss cl to produce the final objective function:

Stage-2: Answer Reasoning with the Multi-Granularity Evidence
As shown in Figure 3, answer reasoning module p rea (A|E) is responsible for deriving answer based on the output of the evidence retriever.This module consists of two essential components, evidence selector and reader.Algorithm 1 shows the whole workflow of this module.

Evidence Selector
As analysed above, multi-granularity evidence is easier to be accurately scored by the retriever.However, they contain too much redundant information and brings burden to the reader.To alleviate this issue, the evidence selector (E-SEL) is proposed to navigate the fine-grained evidence for the reader based on the multi-granularity evidence.Specifically, we define two scores s i,j tab and s i,j pass to respectively indicate the probabilities that the answer is in the cell c i,j and the links L i,j .When calculating the scores, we take the multi-granularity evidence into account.Suppose that the link set L i,j has X links denoted as {l x , x ∈ [1, X]}.The s i,j tab and s i,j pass are calculated as follows: where s j col , s i row , s i,j cell , and s x link are retrieval scores.After that, we can obtain two global maximum scores s tab and s pass as follows: In-Passage, s tab ≤ s pass (13) where In-Table denotes that the answer to the question is in the cell, while In-Passage indicates that the answer is in the link.
After obtaining the answer type, we can navigate the cell or link as the final fine-grained evidence to derive the answer.Formally, if the type is In-Table , the cell with score s tab will be selected and its value will be returned as the answer.Otherwise, the link with the score s pass will be selected and it will be feed to the reader to extract answer.

Reader
For the reader RC(Q, l), following existing works (Chen et al., 2020;Sun et al., 2021), we implement it with a span-based reading comprehension model which adopt BERT as backbone.

Dataset
To verify our proposed model, we conduct experiments on HybridQA3 (Chen et al., 2020), a dataset of multi-hop question answering over tabular and textual data.The basic statistics of HybridQA are listed in Table 2.In the data split, 'In-Table ' means the answer is a table cell value, and 'In-Passage' means the answer exists in a linked passage.'Compute' means the answer is computed by performing numerical operations.In this paper, we mainly focus on the first two types.

Model Dev Test
In-Table In

Settings
Two variants of BERT (Devlin et al., 2019) are utilized in MuGER 2 , namely MuGER 2 -base and MuGER 2 -large.The learning rate is set to 5e-5 and the training batch size is set to 30.A batch is divided into 4 groups corresponding to the 4 kinds of evidence and each group contains 6 instances.

Baselines
MQA-QG (Pan et al., 2021) is an unsupervised framework that generates multi-hop questions from the hybrid evidence and uses the generated questions to train the QA model.Table -Only (Chen et al., 2020) only relies on the tabular information to find the answer by parsing the question into a symbolic form and executes it.Passage-Only (Chen et al., 2020) only uses the hyperlinked passages to find the answer by retrieving related passages to perform the RC process.Hybrider (Chen et al., 2020) solves the HybridQA using a two-stage pipeline framework to retrieve a table cell and extract the answer in its value or linked passages.Dochopper (Sun et al., 2021) builds a retrieval index for each cell and passage sentence, and perform RC on the retrieved evidence.

Results and Analysis
We use the EM and F1 as the evaluation metrics to compare the performance of our MuGER 2 and that of baselines.As shown in Table 1, MuGER 2 -base outperforms strong baselines of HybridQA task and achieves significant improvements on both development and test sets.Specifically, its performance exceeds the Hybrider-larger whose parameter scale is larger than it.This indicates that our proposed multi-granularity evidence is beneficial for this task.In addition, compared with the base model, MuGER 2 -large further obtains gains.

Stage-1: Retrieval
Effect of Contrastive Learning Without the contrastive learning (CL) within each granularity of evidence, the unified retriever is only trained by the binary cross entropy objective in the joint way.
The results in Table 3 demonstrate that CL brings 0.8% EM and 1.1% F1 improvements to our model.These improvements are from the ability of CL to enhance the evidence representation learning by pushing away the positive and negative evidence representation in the semantic space.

Effect of Joint Training
In our method, we adopt the joint training (JT) method to train the unified retriever to capture the relative information among different granularity.To prove the effectiveness of this design, we compare it with a straightforward way which trains four separate retrievers for different granularity evidence.For fairness, we turn off the contrastive learning and only use the binary cross entropy objective to train the four separate retrievers and the unified retriever.The results in Table 3 show that JT brings 1.6% EM and 1.3% F1 improvements to the HQA performance.
Besides the end-to-end HQA performance, we further give the retrieval recall of the four different granularity evidence and the complete multigranularity evidence in Figure 4.The retrieval recall (R@1) indicates whether the retrieved top-1

Model
In-Table In  candidate of the corresponding evidence contains the answer.The results show that JT improves the retrieval performance on all these retrieval tasks.
It is proved that sharing knowledge from different granularity brings benefit to the evidence retrieval.

Stage-2: Reasoning
Effect of Evidence Selector To show the effectiveness of the evidence selector (E-SEL), we compare it with a straightforward answer reasoning method without any fine-grained evidence navigation.This method simply flattens the retrieved top-1 multi-granularity evidence and feeding the sequence into the RC process.Results in Table 3 show that the evidence selector increases the results by 7.8% EM and 7.9% F1 compared with the above reasoning method.This is due to its ability of accurately navigating the cell or link based on the multi-granularity evidence.The evidence selector reduces the burden of the RC model.In contrast, the above straightforward method brings redundant and noisy information which leads to weak RC performance.

Baselines
To compare the HQA performance based on different granularity evidence, we conduct baselines for the four kinds of evidence including column, row, cell, and link.Specifically, the baselines of column, row and link granularity first retrieve the top-1 evidence of the selected granularity, and then flatten the retrieved evidence into a sequence to derive the answer by the RC model.While for the cell granularity, the value of the retrieved top-1 cell is directly returned as the answer.Note that, all the retrievers and the RC models of these baselines are initialized by the pre-trained BERT-base model.

Results and Analysis
The end-to-end HQA results based on different granularity evidence are shown in Table 4 and Figure 2. The coarse-grained Column evidence achieves the best retrieval recall.This is because the question domains are easily mapped to a certain table header.However, reasoning the answer in a column is difficult because cells in a column are nearly indistinguishable in the absence of other semantic information, which leads to the lowest HQA performance.In contrast, the fine-grained Cell and Link evidence reduces the burden of answer reasoning but lowers the retrieval recall due to its huge candidates and limited semantic information.Moreover, only adopting the cell or link evidence cannot solve all questions which further causes low HQA performance.Row granularity achieves the best performance among these baselines since it balances the retrieval recall and RC performance to a certain extent.Compared with these baselines, our Multi granularity evidence brings remarkable improvements to the end-to-end HQA performance by preserving the evidence retrieval recall and the RC performance at the same time.

Case Study
Through statistics on the development set of Hy-bridQA, we find there are 135 questions whose answers are only recalled by our multi-granularity evidence, but failed to be recalled by the coarseand fine-grained evidence.An example of such a question is given in Figure 5.The table along with linked passages on the left is the given knowledge to answer the question.The answer facebook watch is contained in link1 of cell(2,0) where (2,0) represents (row2, column0).
The table on the right lists the retrieved different granularity evidence.It is observed that the  methods based on row, cell and link granularity evidence all retrieve incorrect evidence, including <row4, cell(4,0), link2>.The reason is that the wrong evidence has high semantic similarity with the question which causes a higher ranking score.Although the column granularity based method retrieves the evidence <column0> correctly, the reading comprehension model fails to derive the answer text accurately from the column.In contrast, our method retrieves evidence including <column, row, cell, link>.The evidence is correctly retrieved except for the cell since the question is an in-passage question.The result demonstrates that the multi-granularity evidence improves the retrieval recall which benefits from learning the interactive information.The evidence selector further help to navigate the fine-grained cell or link based on the multi-granularity evidence.

Error Analysis
To further analyze our method, we also make statistics on error cases in the prediction of MuGER 2large.The error statistics are based on the development set of HybridQA.As shown in Figure 6, these cases can be divided into three categories according to the causes.The first category is caused by evidence retrieval errors, which means the retrieved evidences overlook the answer.According to our statistics, multi-granularity evidence has an advantage in improving coverage of answers.Therefore, the proportion of this category is very low, only 3%.The second category is caused by the evidence selector, which accounts for about 40% in the whole results.There are two reasons for the evidence selector to make the error.Firstly, although multi-granularity evidence improves the coverage, it also introduces noise.As a result, it may also lead to a mis-navigation of the fine-grained cell or link.In our results, it accounts for about 33%.In addition, for those questions which are offered correct cells or links, they also may be answered incorrectly due to the wrong judgment of the answer type.It accounts for about 7% in our results.The last category is caused by the reading comprehension module.Limited by the performance of the RC model, even if feed correct fine-grained evidence, the RC model also may produce the wrong answer.According to our analysis, this category has the maximum proportion, accounting for about 57% of the whole error cases.

Discussion
More recently, two novel approaches, MATE (Eisenschlos et al., 2021) and MITQA (Kumar et al., 2021), are proposed and achieve significant results, respectively 70.0 and 71.9 F1 scores on the HybridQA test set.Specifically, MATE mainly focuses on strengthening the ability to encode large tables with multi-view attention.While MITQA designs a multi-instance training method based on distant supervision to filter the noise from multiple answer spans.In contrast, our MuGER 2 puts more effort into integrating multi-granularity evidence for the HQA problem to enhance the reasoning ability.We think that the two novel approaches and our model pay attention to different perspectives for solving the HQA problems, and it's probably to combine them to achieve better performance.

Related Work
Recently, many researchers attempt to tackle the question answering task using heterogeneous knowledge (Sun et al., 2019;Sawant et al., 2019;Zayats et al., 2021;Li et al., 2021;Xiong et al., 2019;Zhou et al., 2022b).However, traditional researches only combine the information of heterogeneous knowledge, but fail to answer questions that require reasoning over heterogeneous data.To fill in the gap, the hybrid question answering (HQA) task together with the HybridQA dataset is proposed by (Chen et al., 2020).HybridQA provides a WiKiTable (Pasupat and Liang, 2015) along with its hyperlinked Wikipedia passages (Rajpurkar et al., 2016) as evidence for each question.Based on the HybridQA, Chen et al. (2021a) further propose the OTT-QA dataset, which requires the system to retrieve relevant tables and text for the given questions.Zhu et al. (2021) andandChen et al. (2021b) propose TAT-QA and FinQA requiring numerical reasoning over the heterogeneous data.
Existing works on HybridQA usually retrieve mono-granularity evidence from the heterogeneous data to derive the answer.Hybrider (Chen et al., 2020) proposes a two-phase pipeline framework to retrieve a table cell as the evidence and feeds its value and linked passages into an RC model to extract the answer.Dochopper (Sun et al., 2021) propose an end-to-end multi-hop retrieval method to directly retrieve a passage sentence or a cell value as the evidence.In addition, Pan et al. (2021) explore an unsupervised multi-hop QA model MQA-QG, which can generate human-like multi-hop training data from heterogeneous data resources.

Conclusion
We propose MuGER 2 , a multi-granularity evidence retrieval and reasoning approach for hybrid question answering (HQA).In our method, a unified retriever is designed to learn multi-granularity evidence and an evidence selector is proposed to navigate fine-grained evidence for the reader.Experiment results show that the MuGER 2 boosts the end-to-end HQA performance and outperforms the strong baselines on the HybridQA benchmark.

Limitations
In this paper, we focus on the hybrid question answering task, in which the answers to most questions can be extracted from the cell values or linked passages using the reading comprehension model.Although our MuGER 2 performs well on this task, one limitation is that it cannot answer the questions needing numerical operations, such as count, compare and etc.To enable our model to answer more complex questions, in future work we will develop the numerical reasoning capabilities of our model.

Figure 1 :
Figure 1: Two example questions of HQA task.
Figure2: The retriever and reasoner performance with different granularity evidence.Note that the F 1 score in this figure is calculated based on the oracle evidence to evaluate the answer reasoning performance.Intuitively, R@1 × F 1 reflects the end-to-end HQA performance.

Figure 3 :
Figure 3: The framework of MuGER 2 which performs multi-granularity evidence retrieval and answer reasoning.Evidence of four kinds of granularity including { column , row , cell , link } are in color.

Q:
Which social network video platform does the program broacast since 1992 on MTV by ViacomCBS now air ？ A : Facebook Watch is .... that currently airs on Facebook Watch after airing on MTV from 1992-2017 ....

Figure 5 :Figure 6 :
Figure 5: An example of evidence retrieval and answer reasoning results based on different granularity evidence.3%

Table Link L(T) Multi-Granularity Evidence Retrieval Multi-Granularity Evidence a cell a link Stage-1：Retrieval Stage-2：Reasoning
Answer Reasoning Input: question Q, table T , links L Output: answer text A 1 calculate p ret (E|Q, T , L); 2 if E-SEL(p ret (•))==In-Table then 3 get the cell c tab with score s tab ; These two scores, s tab and s pass , indicate that the answer is in the table or passages by simply comparing their values.Meanwhile, the cell and link corresponding to them are considered as the navigated evidence for the reader.Based on s tab and Algorithm 1:

Table 1 :
The EM and F1 scores on HybridQA of different models.

Table 4 :
HQA results of different granularity evidence.