Alleviating Performance Degradation Caused by Out-of-Distribution Issues in Embedding-Based Retrieval

Haotong Bao; Jianjin Zhang; Qi Chen; Weihao Han; Zhengxin Zeng; Ruiheng Chang; Mingzheng Li; Hao Sun; Weiwei Deng; Feng Sun; Qi Zhang

doi:10.18653/v1/2025.findings-emnlp.340

Alleviating Performance Degradation Caused by Out-of-Distribution Issues in Embedding-Based Retrieval

Haotong Bao, Jianjin Zhang, Qi Chen, Weihao Han, Zhengxin Zeng, Ruiheng Chang, Mingzheng Li, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang

Abstract

In Embedding Based Retrieval (EBR), Approximate Nearest Neighbor (ANN) algorithms are widely adopted for efficient large-scale search. However, recent studies reveal a query out-of-distribution (OOD) issue, where query and base embeddings follow mismatched distributions, significantly degrading ANN performance. In this work, we empirically verify the generality of this phenomenon and provide a quantitative analysis. To mitigate the distributional gap, we introduce a distribution regularizer into the encoder training objective, encouraging alignment between query and base embeddings. Extensive experiments across multiple datasets, encoders, and ANN indices show that our method consistently improves retrieval performance.

Anthology ID:: 2025.findings-emnlp.340
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6418–6427
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.340/
DOI:: 10.18653/v1/2025.findings-emnlp.340
Bibkey:
Cite (ACL):: Haotong Bao, Jianjin Zhang, Qi Chen, Weihao Han, Zhengxin Zeng, Ruiheng Chang, Mingzheng Li, Hao Sun, Weiwei Deng, Feng Sun, and Qi Zhang. 2025. Alleviating Performance Degradation Caused by Out-of-Distribution Issues in Embedding-Based Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6418–6427, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Alleviating Performance Degradation Caused by Out-of-Distribution Issues in Embedding-Based Retrieval (Bao et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.340.pdf
Checklist:: 2025.findings-emnlp.340.checklist.pdf

PDF Cite Search Checklist Fix data