Recovering Gold from Black Sand: Multilingual Dense Passage Retrieval with Hard and False Negative Samples

Tianhao Shen, Mingtong Liu, Ming Zhou, Deyi Xiong


Abstract
Negative samples have not been efficiently explored in multilingual dense passage retrieval. In this paper, we propose a novel multilingual dense passage retrieval framework, mHFN, to recover and utilize hard and false negative samples. mHFN consists of three key components: 1) a multilingual hard negative sample augmentation module that allows knowledge of indistinguishable passages to be shared across multiple languages and synthesizes new hard negative samples by interpolating representations of queries and existing hard negative samples, 2) a multilingual negative sample cache queue that stores negative samples from previous batches in each language to increase the number of multilingual negative samples used in training beyond the batch size limit, and 3) a lightweight adaptive false negative sample filter that uses generated pseudo labels to separate unlabeled false negative samples and converts them into positive passages in training. We evaluate mHFN on Mr. TyDi, a high-quality multilingual dense passage retrieval dataset covering eleven typologically diverse languages, and experimental results show that mHFN outperforms strong sparse, dense and hybrid baselines and achieves new state-of-the-art performance on all languages. Our source code is available at https://github.com/Magnetic2014/mHFN.
Anthology ID:
2022.emnlp-main.730
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10659–10670
Language:
URL:
https://aclanthology.org/2022.emnlp-main.730
DOI:
10.18653/v1/2022.emnlp-main.730
Bibkey:
Cite (ACL):
Tianhao Shen, Mingtong Liu, Ming Zhou, and Deyi Xiong. 2022. Recovering Gold from Black Sand: Multilingual Dense Passage Retrieval with Hard and False Negative Samples. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10659–10670, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Recovering Gold from Black Sand: Multilingual Dense Passage Retrieval with Hard and False Negative Samples (Shen et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.730.pdf