Revisiting Word Embeddings in the LLM Era

Yash Mahajan; Matthew Freestone; Naman Bansal; Sathyanarayanan N. Aakur; Santu Karmaker

Revisiting Word Embeddings in the LLM Era

Yash Mahajan, Matthew Freestone, Naman Bansal, Sathyanarayanan N. Aakur, Santu Karmaker

Abstract

Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.

Anthology ID:: 2025.ijcnlp-long.145
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 2686–2717
Language:
URL:: https://aclanthology.org/2025.ijcnlp-long.145/
DOI:
Bibkey:
Cite (ACL):: Yash Mahajan, Matthew Freestone, Naman Bansal, Sathyanarayanan N. Aakur, and Santu Karmaker. 2025. Revisiting Word Embeddings in the LLM Era. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 2686–2717, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: Revisiting Word Embeddings in the LLM Era (Mahajan et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ijcnlp-long.145.pdf

PDF Cite Search Fix data