EmbedTextNet: Dimension Reduction with Weighted Reconstruction and Correlation Losses for Efficient Text Embedding

Dae Yon Hwang; Bilal Taha; Yaroslav Nechaev

doi:10.18653/v1/2023.findings-acl.625

EmbedTextNet: Dimension Reduction with Weighted Reconstruction and Correlation Losses for Efficient Text Embedding

Dae Yon Hwang, Bilal Taha, Yaroslav Nechaev

Abstract

The size of embeddings generated by large language models can negatively affect system latency and model size in certain downstream practical applications (e.g. KNN search). In this work, we propose EmbedTextNet, a light add-on network that can be appended to an arbitrary language model to generate a compact embedding without requiring any changes in its architecture or training procedure. Specifically, we use a correlation penalty added to the weighted reconstruction loss that better captures the informative features in the text embeddings, which improves the efficiency of the language models. We evaluated EmbedTextNet on three different downstream tasks: text similarity, language modelling, and text retrieval. Empirical results on diverse benchmark datasets demonstrate the effectiveness and superiority of EmbedTextNet compared to state-of-art methodologies in recent works, especially in extremely low dimensional embedding sizes. The developed code for reproducibility is included in the supplementary material.

Anthology ID:: 2023.findings-acl.625
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9863–9879
Language:
URL:: https://aclanthology.org/2023.findings-acl.625
DOI:: 10.18653/v1/2023.findings-acl.625
Bibkey:
Cite (ACL):: Dae Yon Hwang, Bilal Taha, and Yaroslav Nechaev. 2023. EmbedTextNet: Dimension Reduction with Weighted Reconstruction and Correlation Losses for Efficient Text Embedding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9863–9879, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: EmbedTextNet: Dimension Reduction with Weighted Reconstruction and Correlation Losses for Efficient Text Embedding (Hwang et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.625.pdf
Video:: https://aclanthology.org/2023.findings-acl.625.mp4

PDF Cite Search Video