Language Models are Universal Embedders

Xin Zhang; Zehan Li; Yanzhao Zhang; Dingkun Long; Pengjun Xie; Meishan Zhang; Min Zhang

doi:10.18653/v1/2025.xllm-1.21

Language Models are Universal Embedders

Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang

Abstract

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also present detailed analyses and additional evaluations. We hope that this work could encourage the development of more robust open-source universal embedders.

Anthology ID:: 2025.xllm-1.21
Volume:: Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Hao Fei, Kewei Tu, Yuhui Zhang, Xiang Hu, Wenjuan Han, Zixia Jia, Zilong Zheng, Yixin Cao, Meishan Zhang, Wei Lu, N. Siddharth, Lilja Øvrelid, Nianwen Xue, Yue Zhang
Venues:: XLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 252–265
Language:
URL:: https://aclanthology.org/2025.xllm-1.21/
DOI:: 10.18653/v1/2025.xllm-1.21
Bibkey:
Cite (ACL):: Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min Zhang. 2025. Language Models are Universal Embedders. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025), pages 252–265, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Language Models are Universal Embedders (Zhang et al., XLLM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.xllm-1.21.pdf

PDF Cite Search Fix data