Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

Wenhan Xiong, Hong Wang, William Yang Wang


Abstract
Commonly used information retrieval methods such as TF-IDF in open-domain question answering (QA) systems are insufficient to capture deep semantic matching that goes beyond lexical overlaps. Some recent studies consider the retrieval process as maximum inner product search (MIPS) using dense question and paragraph representations, achieving promising results on several information-seeking QA datasets. However, the pretraining of the dense vector representations is highly resource-demanding, e.g., requires a very large batch size and lots of training steps. In this work, we propose a sample-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we use an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a simple progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three open-domain QA datasets, our method consistently outperforms a strong dense retrieval baseline that uses 6 times more computation for training. On two of the datasets, our method achieves more than 4-point absolute improvement in terms of answer exact match.
Anthology ID:
2021.eacl-main.244
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Editors:
Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2803–2815
Language:
URL:
https://aclanthology.org/2021.eacl-main.244
DOI:
10.18653/v1/2021.eacl-main.244
Bibkey:
Cite (ACL):
Wenhan Xiong, Hong Wang, and William Yang Wang. 2021. Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2803–2815, Online. Association for Computational Linguistics.
Cite (Informal):
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering (Xiong et al., EACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eacl-main.244.pdf
Code
 xwhan/ProQA
Data
Natural QuestionsWebQuestions