DocSplit: Simple Contrastive Pretraining for Large Document Embeddings

Yujie Wang, Mike Izbicki


Abstract
Existing model pretraining methods only consider local information. For example, in the popular token masking strategy, the words closer to the masked token are more important for prediction than words far away. This results in pretrained models that generate high-quality sentence embeddings, but low-quality embeddings for large documents. We propose a new pretraining method called DocSplit which forces models to consider the entire global context of a large document. Our method uses a contrastive loss where the positive examples are randomly sampled sections of the input document, and negative examples are randomly sampled sections of unrelated documents. Like previous pretraining methods, DocSplit is fully unsupervised, easy to implement, and can be used to pretrain any model architecture. Our experiments show that DocSplit outperforms other pretraining methods for document classification, few shot learning, and information retrieval tasks.
Anthology ID:
2023.findings-emnlp.945
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14190–14196
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.945
DOI:
10.18653/v1/2023.findings-emnlp.945
Bibkey:
Cite (ACL):
Yujie Wang and Mike Izbicki. 2023. DocSplit: Simple Contrastive Pretraining for Large Document Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14190–14196, Singapore. Association for Computational Linguistics.
Cite (Informal):
DocSplit: Simple Contrastive Pretraining for Large Document Embeddings (Wang & Izbicki, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.945.pdf