Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling

Chao-Wei Huang, Chen-An Li, Tsu-Yuan Hsu, Chen-Yu Hsu, Yun-Nung Chen


Abstract
Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an  ̲Unsupervised  ̲Multilingual dense  ̲Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. All of our source code, data, and models are available: https://github.com/MiuLab/UMR
Anthology ID:
2024.findings-eacl.49
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
736–746
Language:
URL:
https://aclanthology.org/2024.findings-eacl.49
DOI:
Bibkey:
Cite (ACL):
Chao-Wei Huang, Chen-An Li, Tsu-Yuan Hsu, Chen-Yu Hsu, and Yun-Nung Chen. 2024. Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling. In Findings of the Association for Computational Linguistics: EACL 2024, pages 736–746, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling (Huang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.49.pdf
Software:
 2024.findings-eacl.49.software.zip