Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA

Retrieving evidences from tabular and textual resources is essential for open-domain question answering (OpenQA), which provides more comprehensive information. However, training an effective dense table-text retriever is difficult due to the challenges of table-text discrepancy and data sparsity problem. To address the above challenges, we introduce an optimized OpenQA Table-Text Retriever (OTTeR) to jointly retrieve tabular and textual evidences. Firstly, we propose to enhance mixed-modality representation learning via two mechanisms: modality-enhanced representation and mixed-modality negative sampling strategy. Secondly, to alleviate data sparsity problem and enhance the general retrieval ability, we conduct retrieval-centric mixed-modality synthetic pre-training. Experimental results demonstrate that OTTeR substantially improves the performance of table-and-text retrieval on the OTT-QA dataset. Comprehensive analyses examine the effectiveness of all the proposed mechanisms. Besides, equipped with OTTeR, our OpenQA system achieves the state-of-the-art result on the downstream QA task, with 10.1% absolute improvement in terms of the exact match over the previous best system. All the code and data are available at https://github.com/Jun-jie-Huang/OTTeR.


Introduction
Open-domain question answering (Joshi et al., 2017;Dunn et al., 2017;Lee et al., 2019) aims to answer questions with evidence retrieved from a largescale corpus.The prevailing solution follows a twostage framework (Chen et al., 2017), where a retriever first retrieves relevant evidences and then a reader extracts answers from the evidences.Existing OpenQA systems (Lee et al., 2019;Karpukhin et al., 2020;Mao et al., 2021) have demonstrated

Retrieved Passages:
[1] Antwerp Zoo: Antwerp Zoo is a zoo in the centre of Antwerp, Belgium.It is …, established on 21 July 1843.
[2] Boxing: These are the results of the boxing competition at the 1920 Summer Olympics in Antwerp.
[3] Wrestling: At the 1920 Summer Olympics, ten wrestling events were contested, for all men.There were five weight classes … great success in retrieving and reading passages.However, most approaches are limited to questions whose answers reside in single modal evidences, such as free-form text (Xiong et al., 2021b) or semi-structured tables (Herzig et al., 2021).However, solving many real-world questions requires aggregating heterogeneous knowledge (e.g., tables and passages), because massive amounts of human knowledge are stored in different modalities.As the example shown in Figure 1, the supporting evidence for the given question resides in both the table and related passages.Therefore, retrieving relevant evidence from heterogeneous knowledge resources involving tables and passages is essential for advanced OpenQA, which is also our focus.
There are two major challenges in joint tableand-text retrieval: (1) There exists the discrepancy between table and text, which leads to the difficulty of jointly retrieving heterogeneous knowledge and considering their cross-modality connections; (2) The data sparsity problem is extremely severe be- cause training a joint table-text retriever requires large-scale supervised data to cover all targeted areas, which is labourious and impractical to obtain.In light of this two challenges, we introduce an optimized OpenQA Table-TExt Retriever, dubbed OTTER, which utilizes mixed-modality dense representations to jointly retrieve tables and text.Firstly, to model the interaction between tables and text, we propose to enhance mixed-modality representation learning via two novel mechanisms: modality-enhanced representations (MER) and mixed-modality hard negative sampling (MMHN).MER incorporates fine-grained representations of each modality to enrich the semantics.MMHN utilizes table structures and creates hard negatives by substituting fine-grained key information in two modalities, to encourage better discrimination of relevant evidences.Secondly, to alleviate the data sparsity problem and empower the model with general retrieval ability, we propose a retrieval-centric pre-training task with a large-scale synthesized corpus, which is constructed by automatically synthesizing mixed-modal evidences and reversely generating questions by a BART-based generator.
Our primary contributions are three-fold: (1) We propose three novel mechanisms to improve tableand-text retrieval for OpenQA, namely modalityenhanced representation, mixed-modality hard negative sampling strategy, and mixed-modality synthetic pre-training.(2) Evaluated on OTT-QA, OT-TER substantially improves retrieval performance compared with baselines.Extensive experiments and analyses further examine the effectiveness of the above three mechanisms.(3) Equipped with OTTER, our OpenQA system significantly surpasses previous state-of-the-art models with 10.1% absolute improvement in terms of exact match.

Problem Formulation
The task of OpenQA over tables and text is defined as follows.Given two corpus of tables C T = {t 1 , ..., t T } and passages C P = {p 1 , ..., p P }, the task aims to answer question q by extracting answer a from the knowledge resources C P and C T .The standard system of solving this task involves two components: a retriever that first retrieves relevant evidences c ⊂ C T ∪ C P , and a reader to extract a from the retrieved evidence set.

Table-and-text Retrieval
In this paper, we focus on table-and-text retrieval for OpenQA.To better align the mixed-modality information in table-and-text retrieval, we follow Chen et al. (2021) and take a table-text block as a basic retrieval unit, which consists of a table segment and relevant passages.Different from retrieving a single table/passage, retrieving table-text blocks could bring more clues for retrievers to utilize since single modal data often contain incomplete context.Figure 2 illustrates table-and-text retrieval and our overall system.

Table-Text Block
Since relevant tables and passages do not necessarily naturally coexist, we need to construct tabletext blocks before retrieval.One observation is that tables often hold large quantities of entities and events.Based on this observation and prior work (Chen et al., 2020a), we apply entity linking to group the heterogeneous data.Here we apply BLINK (Ledell et al., 2020) to fuse tables and text, which is an effective entity linker and capable to link against all Wikipedia entities and their corresponding passages.Given a flat table segment, BLINK returns l relevant passages linked to the entities in table.However, as table size and passage quantity grow, the input may become too long for BERT-based encoders (Devlin et al., 2019).Thus, we split a table into several segments to limit the input token number that each segment contains only a single row.This setup can be seen as a tradeoff to resolve input limit but our approaches are scalable to full tables when input capacity permits.More details about table-text blocks can be found in Appendix A.1.

The Dual-Encoder Architecture
The prevailing choice for dense retrieval is the dualencoder method.In this framework, a question q and a table-text block b are separately encoded into two d-dimensional vectors by a neural encoder E(•).Then, the relevance between q and b is measured by dot product over these two vectors: (1) The benefit of this method is that all the table-text blocks can be pre-encoded into vectors to support indexed searching during inference time.In this work, we initialize the encoder with a pre-trained RoBERTa (Liu et al., 2019), and take the representation of the first [CLS] token as the encoded vector.
When an incoming question is encoded, the approximate nearest neighbor search can be leveraged for efficient retrieval (Johnson et al., 2021).
Training The training objective aims to learn representations by maximizing the relevance of the gold table-text block and the question.We follow Karpukhin et al. (2020) to learn the representations.Formally, given a training set of N instances, the ) consists of a positive block b + i and m negative blocks {b − i,j } m j=1 , we minimize the cross-entropy loss as follows: .
Negatives are a hard negative and m − 1 in-batch negatives from other instances in a mini-batch.

Modality-enhanced Representation
Most dense retrievers use a coarse-grained singlemodal representation from either the representation of the [CLS] token or the averaged representations of tokens (Zhan et al., 2020;Huang et al., 2021) Given the tokens in a tabular/textual modality, we calculate a representation in the following ways: (1) FIRST: representations of the beginning token (i.e., [TAB] and [PSG]); (2) AVG: averaged token representations; (3) MAX: max pooling over token representations ; (4) SelfAtt: weighted average over token representations where weights are computed by a self attention layer.We discuss the impact of different types of MERs in § 5.3.Our best model adopts FIRST as the final setting.To ensure the same vector dimensionality with the enriched representation, we represent the question by replicating the encoded question representation.

Mixed-modality Hard Negative Sampling
Prior studies (Nogueira and Cho, 2019;Gillick et al., 2019) have found that hard negative sampling is essential in training a dense retriever.These methods take each evidence as a whole and retrieve the most similar irrelevant one as the hard negative.Instead of finding an entire irrelevant block, we propose a mixed-modality hard negative sampling mechanism, which constructs more challenging hard negatives by only substituting partial information in the table or text.
Formally, suppose a positive block b j+ = (t j , p j ) is from the j-th row in the table, the answer a resides in either table segment t j or passages p j .We decide to replace either the table row or the passage depending on where the answer exists.If a exists in the table row, we construct a hard negative b j− = (t k , p j ) by replacing t j with a random row t k in the same table.Similarly, if a resides in the passages, we create hard negative b j− = (t j , p k ) by replacing passages with p k in other blocks.

Mixed-modality Synthetic Pre-training
To alleviate the issue of data sparsity, we propose a mixed-modality synthetic pre-training (MMSP) task.MMSP enhances the retrieval ability by pretraining on a large-scale synthesized corpus, which involves mixed-modality pseudo training data with (question, table-text block) pairs.Here, we introduce a novel way to construct the pseudo training corpus in two steps, including table-text block mining and question back generation.
(1) Mine relevant table-text pairs.One observation is that Wikipedia hyperlinks often link explanatory passages to entities in tables, which provides high-quality relevant table-text pairs.Based on this, we believe Wikipedia is an excellent resource for parsing table-text pairs.Specially, we select a row in a table, and find corresponding passages with the hyperlinks to form a fused table-text block.We only keep the first section in each Wikipedia page as it always contains the most important information about the linked entity.(2) Write pseudo questions for fused blocks.The questions are expected to not only contain the mixed-modality information from the blocks, but also have good fluency and naturalness.Therefore, instead of using templatebased synthesizing methods, we use a generationbased method to derive more fluent and diverse questions, which is called back-generation.Specially, we use BART base (Lewis et al., 2020) as the backbone of our generator, which is fine-tuned with oracle pairs of (question, table-text block) in the OTT-QA training set.The input to the generator is a sequence of the flat table and linked passages, and the output is a mixed-modality question.Finally, we automatically construct a large-scale pre-training corpus.We present some examples of generated pseudo questions in Appendix A.2. Finally, we obtain a synthesized corpus with 3M pairs of table-text blocks and pseudo questions.
During pre-training, we adopt a similar ranking task where the training objective is the same as described in § 3.1.As for negative sampling, we use in-batch negatives and one hard negative randomly sampled from the same table.

Experiment Settings
In this section, we describe the experiment settings on the task of open-domain question answering over tables and text, and report the performance of our system on the table-and-text retrieval, and downstream question answering.

Dataset
Our system is evaluated on the OTT-QA dataset (Chen et al., 2021)

Evaluation Metrics
Both recall and accuracy can reflect retrieval performance and are commonly used in IR.In this paper, we use the recalls rather than precision-based metrics since they are widely used metrics in recent retrievers like MVR (Zhang et al., 2022b) and RocketQA (Qu et al., 2021).Recall@k aims to capture answers in top-k results and ensure answers can be seen as the input of QA models, which is computed as the proportion of relevant items found in the top-k returned items.
In this paper, we use two metrics to evaluate the retrieval system: one is table recall and the other is table-text block recall.Table recall indicates whether the top-k retrieved blocks come from the ground-truth table.However, in tableand-text retrieval, table recall is imperfect as an coarse-grained metric since our basic retrieval unit is a table-text block corresponding to a specific row in the table.Therefore we propose a more finegrained and challenging metric: table-text block recall at top k ranks, where a fused block is considered as a correct match when it meets two requirements.Firstly, it comes from the ground truth table.Second, it contains the correct answer.On the downstream QA task, we report the exact match (EM) and F1 score (Chen et al., 2021) to evaluate OpenQA system.

Experiments: Table-and-Text Retrieval
In this section, we evaluate the retrieval performance of our OpenQA Table-Text Retriever (OTTER).We first compare OTTER with previous retrieval approaches on OTT-QA.Then we conduct extensive experiments to examine the effectiveness of the three proposed mechanisms.

Experiment Settings
Baseline Methods We compare with the following 8 retrievers.(1) BM25 w/o text (Chen et al., 2021) is a sparse method to retrieve tabular evidence, where the flat table with metadata (i.e., table title and section title) and content are used for retrieval.(2) Iterative Retriever (Chen et al., 2021) is a dense retriever which iteratively retrieves tables and passages in 3 steps.(3) Fusion Retriever (Chen et al., 2021) is the only existing dense method to retrieve table-text block, which uses a GPT2 (Radford et al., 2019) to link passages and the Inverse Cloze Task (Lee et al., 2019) to pretrain the encoder.We directly reports results of the above three models from original papers.As the only reported retrieval performance of Iterative/Fusion Retriever is Hit4K (whether the answer exists in the retrieved 4096 tokens), for fair comparison, we also report results of our implemented models with Hit4K.(4) Bi-Encoder and (5) Tri-Encoder (Kostić et al., 2021) are two dense retrievers which encode question, table and text with Models R@1 R@10 R@20 R@50 R@100 Hit@4K BM25 Implementation Details We use RoBERTa-base (Liu et al., 2019) as the backbone of our retrievers since RoBERTa is an improved version of BERT (Devlin et al., 2019).The retrievers take as the input a maximum input length of 512 tokens per table-text block and 70 tokens per question.The retrievers are trained using in-batch negatives and one additional hard negative for both pre-training and fine-tuning.On the pre-training stage, we pretrain on the synthesized corpus for 5 epochs on 8 Nvidia Tesla V100 32GB GPUs with a batch size of 168.We use AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 3e-5, linear scheduling with 5% warm-up.On the fine-tuning stage, we train all retrievers for 20 epochs with a batch size of 64, learning rate of 2e-5 and warm-up ratio of 10 % on 8 Nvidia Tesla V100 16GB GPUs.especially large when k is smaller (e.g., 8.2% absolute gain for R@10), which demonstrates the effectiveness of OTTER;

Main Results
(2) When textual passages are removed during retrieval (OTTER w/o text), the performance of OTTER drops dramatically, especially when k is smaller.This phenomenon shows the importance of taking textual information as a complement to tables.

Ablation Study
To examine the effectiveness of the three mechanisms in OTTER, we conduct extensive ablation studies on OTT-QA and discuss our findings below.

Effect of Modality-enhanced Representation
In this experiment, we explore the effect of modality-enhanced representations (MER) on retrieval performance.Table 3 reports the table recall and block recall of our models with different MER strategies on the OTT-QA dev.set.We also report the result after eliminating MER, i.e., using only the representation of the [CLS] token for ranking.We find that integrating modality-enhanced representations improves the retrieval performance significantly.As MER incorporates single-modal representations to enrich the mixed-modal representation, retrievers can easily capture the comprehensive semantics of table-text blocks.In addition, among all the strategies for MER, the FIRST strategy using the representation of the beginning special token of each modality achieves the best performance.This observation verifies the stronger representative ability of the FIRST strategy compared with other pooling strategies.

Effect of Mixed-modality Negative Sampling
To investigate the effectiveness of hard negative sampling on retrieval, we evaluate our system under following settings of hard negative sampling on the OTT-QA development set  text block in the same table containing no answer.
From the results shown in Table 4, we can observe that training the retriever with MMHN yields the best performance compared with other hard negative sampling strategies.Since mixed-modality hard negatives is constructed by only replacing partial information from the positive block, it is more challenging and it enables the retriever to better distinguish important information in the evidence.

Effect of Mixed-modality Synthetic Pre-training
We investigate the effectiveness of mixed-modality synthetic pre-training.We first pre-train the retriever and then fine-tune the retriever with OTT-QA training set.The pre-training corpus consisting of 3 millions of (question, evidence) pairs, with questions synthesized in the following ways: (1) BartQ: the questions are generated by BART as described in § 3.4; (2) TitleQ: the questions are constructed from passage titles and table titles.(3) DA w/o PT: data augmentation without pre-training, where we integrate the BART synthetic corpus with the oracle data together for fine-tuning.(4)w/o PT direct fine-tuning without pre-training.
The retrieval results on the dev.set of OTT-QA are exhibited in Table 5.We can find that: (1) Pre-training brings substantial performance gain to dense retrieval, showing the benefits of automatically synthesizing large-scale pre-training corpus to improve retrievers.(2) synthesizing questions using BART-based generator performs better than using template-based method (TitleQ).We attribute it to more fluent and diverse questions synthesized by generation-based method.(3) Using the synthesized corpus for data augmentation performs much poorer than using it for pre-training, and even worse than directly fine-tuning without pre-training.One explanation is that pre-training targets to help the model in learning a more general retrieving ability beforehand, while fine-tuning aims to learns a more specific and accurate retriever.As the synthesized corpus is more noisy, using it as augmented fine-tuning data may make the training unstable and lead to a performance drop.This observation again verifies the effectiveness of pre-training with mixed-modality synthetic corpus.

Case Study
Here, we give an example of retrieved evidences to show that OTTER correctly represents questions and blocks with the proposed three strategies.As shown in Figure 4, to answer the question, the model should find relevant table-text blocks with two pieces of evidences distributed in tables and passages, including the "skier who won 6 gold medals at the FIS Nordic Junior World Ski Championships" and the "year when the skier started competing".As we can see, OTTER successfully returns a correct table-text block at rank 1, which includes all necessary information.The top-2 retrieved block by OTTER is also reasonable, since partial evidences like 6 gold medals and Ski Championships are matched.However, OTTER-baseline (w/o three mechanisms) returns an unsatisfactory block.Though the retriever finds the Ski Championships , which is a strong signal to locate the table, it fails to capture fine-grained information like 6 gold medals and starting year.
This case demonstrates that OTTER can capture the more accurate meanings of fused table-text block, especially when the supported information resides separately.It shows that enhancing crossmodal representations with proposed mechanisms is beneficial to modeling heterogeneous data.

Experiments: Question Answering
In this section, we experiment to show how OT-TER affects the downstream QA performance.

Reader
We implement a two-stage open-domain question answering system, which is equipped with our OT-TER as the retriever and a reader model for extracting the answer from the retrieved evidence.As we mainly focus on improving the retriever in this Dev Test Retriever + Reader EM F1 EM F1 BM25 + HYBRIDER (Chen et al., 2020a) 10.3 13.0 9.7 12.8 BM25 + DUREPA (Li et al., 2021) 15.8 ---Iterative Retriever + SBR (Chen et al., 2021) 7.9 11.1 9.6 13.1 Fusion Retriever + SBR (Chen et al., 2021) 13.8 17.2 13.4 16.9 Iterative Retriever + CBR (Chen et al., 2021) 14.4 18.5 16.9 20.9 Fusion Retriever + CBR (Chen et al., 2021) 28 paper, we use the state-of-the-art reader model to evaluate the downstream QA performance.Following Chen et al. (2021), we use the Cross Block Reader (CBR) to extract answers.The CBR jointly reads the concatenated top-k retrieved tabletext blocks and outputs a best answer span from these blocks.In contrast to Single Block Readers (SBR) that read only one block at a time, CBR is more powerful in utilizing the cross-attention mechanism to model the cross-block dependencies.We take the pre-trained Long-Document Transformer (Longformer) (Beltagy et al., 2020) as the backbone of CBR, which applies sparse attention mechanism and accepts longer input sequence of up to 4,096 tokens.For fair comparison with Chen et al. (2021), we feed top-15 retrieved blocks into the reader model for inference.To balance the distribution of training data and inference data, we also takes k table-text blocks for training, which contains several ground-truth blocks and the rest of retrieved blocks.The training objective is to maximize the marginal log-likelihood of all the correct answer spans in the positive block.The reader is trained with 8 Nvidia V100 GPUs for 5 epochs with a batch size of 16 and learning rate of 1e-5.

Results
The results are shown in Table 6.We find that OTTER + CBR significantly outperforms existing OpenQA systems, with 10.1% performance gain from test set in terms of EM over the prior state-of-the-art system and 10.0 % EM gain from dev set over OTTER-baseline + CBR.The results demonstrate that our proposed three approaches can retrieve better supported evidences to the question, which leads to further improvement on the downstream QA performance.
To further analyze the effect of different components of OTTER on QA performance, we conduct an ablation study on OTT-QA after eliminat- ing different components.As shown in Figure 5, the OpenQA system with full OTTER achieves the best performance, and removing each component leads to a substantial performance drop.This observation verifies the effectiveness of our proposed three mechanisms, i.e., modality-enhanced representations (MER), mixed-modality hard negatives (MMHN) and mixed-modality synthetic pretraining.We also evaluate the impact of taking different numbers of retrieved blocks as the inputs for inference.As shown in Figure 5, the EM score increases rapidly with k when k < 20 but the growth slows down when k > 20, which can help to find a better tradeoff between efficiency and performance.
Our approach differs from existing methods mainly in two aspects: targeted evidence source and mixed-modality learning mechanisms.First of all, we retrieve mixed-modality evidence from both tabular and textual corpus, which is different from text-based retrievers (Karpukhin et al., 2020;Asai et al., 2020;Xiong et al., 2021b;Xu et al., 2022) and table-based retrievers (Chen et al., 2020b;Shraga et al., 2020;Pan et al., 2021a).Secondly, our proposed three mixed-modality learning mechanisms also differ from existing methods.As for mixed-modality representation, previous work (Karpukhin et al., 2020) mainly uses the single representation of the special token for ranking.Our method incorporates single modal representation to enrich the mixed modal representation.As for mixed-modality negative sampling, instead of finding an entire negative evidence with either sparse or dense methods (Yang et al., 2021;Luan et al., 2021;Lu et al., 2020;Xiong et al., 2021a;Lu et al., 2021;Zhan et al., 2021;Zhang et al., 2022a;Xiao et al., 2022), we construct more challenging hard negative by only replacing partial single-modality information at once.As for mixed-modality synthetic pre-training, our pre-training strategy is different in the pre-training task, knowledge source and the method of synthesizing pseudo question.There are also works investigating joint pre-training over tables and text (Herzig et al., 2020;Eisenschlos et al., 2020;Yin et al., 2020;Oguz et al., 2022).However, these methods mainly take the table metadata as the source of text and do not consider the retrieval task.Instead, we use linked passages as a more reliable knowledge source, and target on retrievalbased pre-training.There are some attempts on incorporating pre-training task to improve retrieval performance (Chang et al., 2020;Sachan et al., 2021;Oguz et al., 2022;Wu et al., 2022), which target on textual-domain retrieval or using templatebased method for query construction.Differently, our approach focuses on a more challenging setting that retrieves evidence from tabular and textual corpus and adopts a generation-based query synthetic method.Besides, Pan et al. (2021b) explore to generate multi-hop questions for tables and text, but they focus on an unsupervised manner.

Conclusion
In this paper, we propose an optimized dense retriever called OTTER, to retrieve joint table-text evidences for OpenQA.OTTER involves three novel mechanisms to address table-text discrepancy and data sparsity challenges, i.e., modality-enhanced representations, mixed-modality hard negative sampling, and mixed-modality synthetic pre-training.We experiment on OTT-QA dataset and evaluate on two subtasks, including retrieval and QA.Results show that OTTER significantly outperforms other retrieval methods by a large margin, which further leads to a substantial absolute performance gain of 10.1% EM on the downstream QA.Extensive experiments illustrate the effectiveness of all three mechanisms in improving retrieval and QA performance.Further analyses also show the ability of OTTER in retrieving more relevant evidences from heterogeneous knowledge resources.

Limitations
Firstly, the mixed modality input bears a strong problem of input length.As the table size could be large, the size of linked passages could also be large, which leads to exceeding the maximum input length of PLMs.Thus we make a trade-off option to break down the table into rows, which fails to answer questions that require information among multiple rows.Secondly, OTT-QA is a fairly limited dataset, where answering each question MUST require information from two modalities.Thus it's not clear of the model's performance on single modal OpenQA where any one modality may not be required.It's also unclear whether incorporating tabular information will help textual OpenQA.Thirdly, as the size of in-batch negative examples could heavily influence the retrieval performance, training a strong dense retriever requires large GPU resources, where we use 8 Nvidia Tesla V100 16GB GPUs for each experiment.

A.1 Table-Text Block Representation
The table-text block representation is illustrated in Figure 6.Following Chen et al. (2021), we involve the title and section title of a table and prefix them to the table cell.We also flatten the column name and column value with an "is " token to obtain more natural and fluent utterance.In addition, we add different special tokens to separate different segments, including Such a flattened block will be used throughout this paper as the input string to the retriever and the reader.
In OTT-QA dataset, long rows frequently appear in tables, which leads to more entities and passages in a single table-text block.To maintain more relevant information in a block, we rank the passages with the TF-IDF score to table schema and table content.Then we remove the tokens when a flattened block is out of the input length limit of the RoBERTa tokenizer.

B.2 Entity Linking
To understand the effects of entity linking, we evaluate the standalone entity linking accuracy and the retrieval performance.We consider the following linking models: (1) GPT-2 used in Chen et al. (2021), which first augments the cell value by the context with a GPT-2 (Radford et al., 2019) and then uses BM25 to rank the blocks to the augmented form, (2) BLINK (Ledell et al., 2020) used in OTTER, which applys a bi-encoder ranker and cross-encoder re-ranker to link Wikipedia passages to the entities in flattened tables, (3) Oracle linker, which uses the original linking passages in the table.
We evaluate the entity linking of on the OTTQA dev.set following the settings in Chen et al. (2021) and report the table-segment-wise F1 score.Table 8 shows the performance.We find that the F1 score of BLINK is higher than GPT-2, which leads to more relevant passages for tables.

Figure 1 :
Figure 1: An example of the open question answering over tables and text.Highlighted phrases in the same color indicate evidence pieces related to the question in each single modality.The answer is marked in red.

Figure 2 :
Figure 2: The framework of the overall OpenQA system.It first jointly retrieves top-k table-text blocks with our OTTER.Then it answers the questions from the retrieved evidence with a reader model.

table and h text ) to enhance the seman- tics of table and text
. Thus, the modality-enhancedrepresentation is b = [h [CLS] ; h table ; h text ],where ; denotes concatenation.

Table 1 :
, which is a large-scale opendomain table-text question answering benchmark.Statistics of OTT-QA and table-text corpus.

Table 2 :
(Chen et al., 2021)esults on OTT-QA dev set.Table recall and Hit@4K(Chen et al., 2021)are reported, where Hit@4K is used to measure whether the answer exists in the retrieved 4096 subword tokens.*denotes the results reproduced by us.seperateBERT, where the former uses a shared encoder for table and text, and the latter uses three encoders for the input.As they don't release code and models, we report our reproduced results in our setting.As there are few works on table-text retrieval, we also implement the following baselines to retrieve from the same corpus in OTTER.
(6) BM25 is a sparse method to retrieve table-text blocks.(7)OTTER-baseline is a dense retriever for table-text blocks using random negatives without MER and pre-training.(8)OTTER w/o text is a dense retriever to retrieve table evidences.We remove textual passages in corpus during retrieval.

Table 2
compares different retrievers on OTT-QA dev.set, using the table recall at top k ranks (k ∈ {1, 10, 20, 50, 100}) because the results from other papers are mainly reported in table recall.We find that: (1) OTTER significantly outperforms previous sparse and dense retrievers and the gap is

Table 3 :
Retrieval performance of OTTER under different modality-enhanced representations (MER) settings.

Table 5 :
Retrieval performance of OTTER under different settings.PT denotes pre-training.

Table 6 :
QA Results on the dev.set and blind test set.
The skier with 6 gold medals at FIS Nordic Junior World Ski Championships, started competing in what year ?Figure4: Examples of table-text blocks returned by full OTTER and OTTER without modality-enhanced representations.Words in the retrieved blocks of the same color denote the evidences corresponding to questions.
Q:A: 2000 [TAB] for table segment, [PSG] for passage segment, [TITLE] for table title, [SECTITLE] for section title, [DATA] for table content, and [SEP] to separate different passages.
Venue is Antwerp Zoo.Sports is Boxing, Wrestling.Capacity is Not listed.[PSG]AntwerpZoo is a zoo in the centre of Antwerp …. [SEP] These are the results of the boxing competition at the 1920 …. [SEP] At the 1920 Summer Olympics, ten wrestling events were contested….To provide a better understanding of mixedmodality synthetics pre-training, we give some examples of pseudo training data with (question, table-text block) pairs in Table7.As we can see, the generated questions not only are fluent and natural, but also consider mixed-modality information from tables and passages.Here, we show the detailed retrieval results of OT-TER with different components in Figure7.The table recall at top-k ranks and block recall at top-k ranks are reported.We can find that full OTTER substantially surpasses the models of other settings in block recall, and in table recall when k ≤ 50. Figure 7: retrieval performance of retrievers on the dev set of OTT-QA.Full OTTER substantially surpasses the other models in block recall, and in table recall when k ≤ 50.