SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Wuttikorn Ponwitayarat; Peerat Limkonchotiwat; Raymond Ng; Jann Railey Montalan; Thura Aung; Jian Gang Ngui; Yosephine Susanto; William Chandra Tjhi; Panuthep Tasawong; Erik Cambria; Ekapol Chuangsuwanich; Sarana Nutanong

SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Raymond Ng, Jann Railey Montalan, Thura Aung, Jian Gang Ngui, Yosephine Susanto, William Chandra Tjhi, Panuthep Tasawong, Erik Cambria, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract

Multilingual text embeddings are often assumed to encode meaning in a perspective-independent semantic space, yielding stable similarity judgments across tasks and languages. Our results show that this assumption does not hold in practice. We introduce SEA-BED, a large-scale benchmark covering 10 Southeast Asian (SEA) languages and diverse embedding tasks, designed to systematically examine how embedding performance varies across tasks, languages, and language-task combinations. Across extensive evaluations, we observe that no single model performs uniformly well across SEA languages; task difficulty differs markedly within languages, and success on one task does not reliably generalize to others. Language-task analyses further reveal highly non-uniform performance landscapes, where performance varies across different language-task combinations. These findings call for closer attention to performance measurements that provide an expansive view across languages and tasks to uncover inconsistencies in semantic representation. Based on these observations, we provide insights for future model development, including data, algorithmic, and architectural considerations.

Anthology ID:: 2026.acl-long.397
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8788–8822
Language:
URL:: https://aclanthology.org/2026.acl-long.397/
DOI:
Bibkey:
Cite (ACL):: Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Raymond Ng, Jann Railey Montalan, Thura Aung, Jian Gang Ngui, Yosephine Susanto, William Chandra Tjhi, Panuthep Tasawong, Erik Cambria, Ekapol Chuangsuwanich, and Sarana Nutanong. 2026. SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8788–8822, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SEA-BED: How Do Embedding Models Represent Southeast Asian Languages? (Ponwitayarat et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.397.pdf
Checklist:: 2026.acl-long.397.checklist.pdf

PDF Cite Search Checklist Fix data