Exploring efficient zero-shot synthetic dataset generation for Information Retrieval

Tiago Almeida, Sérgio Matos


Abstract
The broad integration of neural retrieval models into Information Retrieval (IR) systems is significantly impeded by the high cost and laborious process associated with the manual labelling of training data. Similarly, synthetic training data generation, a potential workaround, often requires expensive computational resources due to the reliance on large language models. This work explored the potential of small language models for efficiently creating high-quality synthetic datasets to train neural retrieval models. We aim to identify an optimal method to generate synthetic datasets, enabling training neural reranking models in document collections where annotated data is unavailable. We introduce a novel methodology, grounded in the principles of information theory, to select the most appropriate documents to be used as context for question generation. Then, we employ a small language model for zero-shot conditional question generation, supplemented by a filtering mechanism to ensure the quality of generated questions. Extensive evaluation on five datasets unveils the potential of our approach, outperforming unsupervised retrieval methods such as BM25 and pretrained monoT5. Our findings indicate that an efficiently generated “silver-standard” dataset allows effective training of neural rerankers in unlabeled scenarios. To ensure reproducibility and facilitate wider application, we will release a code repository featuring an accessible API for zero-shot synthetic question generation.
Anthology ID:
2024.findings-eacl.81
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1214–1231
Language:
URL:
https://aclanthology.org/2024.findings-eacl.81
DOI:
Bibkey:
Cite (ACL):
Tiago Almeida and Sérgio Matos. 2024. Exploring efficient zero-shot synthetic dataset generation for Information Retrieval. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1214–1231, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Exploring efficient zero-shot synthetic dataset generation for Information Retrieval (Almeida & Matos, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.81.pdf