Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval

João Coelho, Bruno Martins, Joao Magalhaes, Jamie Callan, Chenyan Xiong


Abstract
This study investigates the existence of positional biases in Transformer-based language models for text representation learning, particularly in the context of web document retrieval. We build on previous research that demonstrated loss of information in the middle of input sequences for causal language models, extending it to the domain of embedding learning. We examine positional biases at multiple stages of the training pipeline for an encoder-decoder neural retrieval model, namely language model pre-training, contrastive pre-training, and contrastive fine-tuning. Experiments with the MS-MARCO document collection reveal that after contrastive pre-training the model already generates embeddings that better capture the beginning of the input content, with fine-tuning further aggravating this effect.
Anthology ID:
2024.luhme-short.35
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
370–377
Language:
URL:
https://aclanthology.org/2024.luhme-short.35/
DOI:
10.18653/v1/2024.acl-short.35
Bibkey:
Cite (ACL):
João Coelho, Bruno Martins, Joao Magalhaes, Jamie Callan, and Chenyan Xiong. 2024. Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 370–377, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval (Coelho et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-short.35.pdf