Length is a Curse and a Blessing for Document-level Semantics

Chenghao Xiao, Yizhi Li, G Hudson, Chenghua Lin, Noura Al Moubayed


Abstract
In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, **LA(SER)3**: length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark. [Our code is publicly available.](https://github.com/gowitheflow-1998/LA-SER-cubed)
Anthology ID:
2023.emnlp-main.86
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1385–1396
Language:
URL:
https://aclanthology.org/2023.emnlp-main.86
DOI:
10.18653/v1/2023.emnlp-main.86
Bibkey:
Cite (ACL):
Chenghao Xiao, Yizhi Li, G Hudson, Chenghua Lin, and Noura Al Moubayed. 2023. Length is a Curse and a Blessing for Document-level Semantics. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1385–1396, Singapore. Association for Computational Linguistics.
Cite (Informal):
Length is a Curse and a Blessing for Document-level Semantics (Xiao et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.86.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.86.mp4