Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning

Revanth Gangi Reddy, Vikas Yadav, Md Arafat Sultan, Martin Franz, Vittorio Castelli, Heng Ji, Avirup Sil


Abstract
Research on neural IR has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this paper, we propose to improve the out-of-domain generalization of Dense Passage Retrieval (DPR) - a popular choice for neural IR - through synthetic data augmentation only in the source domain. We empirically show that pre-finetuning DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation.
Anthology ID:
2022.coling-1.89
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1065–1070
Language:
URL:
https://aclanthology.org/2022.coling-1.89
DOI:
Bibkey:
Cite (ACL):
Revanth Gangi Reddy, Vikas Yadav, Md Arafat Sultan, Martin Franz, Vittorio Castelli, Heng Ji, and Avirup Sil. 2022. Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1065–1070, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning (Gangi Reddy et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.89.pdf
Data
BioASQNatural QuestionsTriviaQAWebQuestionsWikiMovies