Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Sonal Sannigrahi, Josef van Genabith, Cristina España-Bonet


Abstract
Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art resultsin various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks
Anthology ID:
2023.findings-eacl.174
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2306–2316
Language:
URL:
https://aclanthology.org/2023.findings-eacl.174
DOI:
10.18653/v1/2023.findings-eacl.174
Bibkey:
Cite (ACL):
Sonal Sannigrahi, Josef van Genabith, and Cristina España-Bonet. 2023. Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2306–2316, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings? (Sannigrahi et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.174.pdf
Video:
 https://aclanthology.org/2023.findings-eacl.174.mp4