Noisy Self-Training with Synthetic Queries for Dense Retrieval

Fan Jiang, Tom Drummond, Trevor Cohn


Abstract
Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.
Anthology ID:
2023.findings-emnlp.803
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11991–12008
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.803
DOI:
10.18653/v1/2023.findings-emnlp.803
Bibkey:
Cite (ACL):
Fan Jiang, Tom Drummond, and Trevor Cohn. 2023. Noisy Self-Training with Synthetic Queries for Dense Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11991–12008, Singapore. Association for Computational Linguistics.
Cite (Informal):
Noisy Self-Training with Synthetic Queries for Dense Retrieval (Jiang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.803.pdf