Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval

Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, Xing Xie


Abstract
Pre-trained language models (PLMs) have achieved the preeminent position in dense retrieval due to their powerful capacity in modeling intrinsic semantics. However, most existing PLM-based retrieval models encounter substantial computational costs and are infeasible for processing long documents. In this paper, a novel retrieval model Longtriever is proposed to embrace three core challenges of long document retrieval: substantial computational cost, incomprehensive document understanding, and scarce annotations. Longtriever splits long documents into short blocks and then efficiently models the local semantics within a block and the global context semantics across blocks in a tightly-coupled manner. A pre-training phase is further proposed to empower Longtriever to achieve a better understanding of underlying semantic correlations. Experimental results on two popular benchmark datasets demonstrate the superiority of our proposal.
Anthology ID:
2023.emnlp-main.223
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3655–3665
Language:
URL:
https://aclanthology.org/2023.emnlp-main.223
DOI:
10.18653/v1/2023.emnlp-main.223
Bibkey:
Cite (ACL):
Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, and Xing Xie. 2023. Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3655–3665, Singapore. Association for Computational Linguistics.
Cite (Informal):
Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval (Yang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.223.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.223.mp4