Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Nandan Thakur; Crystina Zhang; Xueguang Ma; Jimmy Lin

doi:10.18653/v1/2025.findings-emnlp.481

Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin

Abstract

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness — pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35×, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on “false negatives”, where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 points on BEIR and by 1.7-1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available.

Anthology ID:: 2025.findings-emnlp.481
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9064–9083
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.481/
DOI:: 10.18653/v1/2025.findings-emnlp.481
Bibkey:
Cite (ACL):: Nandan Thakur, Crystina Zhang, Xueguang Ma, and Jimmy Lin. 2025. Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9064–9083, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs (Thakur et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.481.pdf
Checklist:: 2025.findings-emnlp.481.checklist.pdf

PDF Cite Search Checklist Fix data