Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Zeli Su; Ziyin Zhang; Guixian Xu; Jianing Liu; Xu Han (韩旭); Ting Zhang; Yushuang Dong

doi:10.18653/v1/2025.acl-long.893

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, Xu Han, Ting Zhang, Yushuang Dong

Abstract

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

Anthology ID:: 2025.acl-long.893
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18259–18270
Language:
URL:: https://aclanthology.org/2025.acl-long.893/
DOI:: 10.18653/v1/2025.acl-long.893
Bibkey:
Cite (ACL):: Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, Xu Han, Ting Zhang, and Yushuang Dong. 2025. Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18259–18270, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages (Su et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.893.pdf

PDF Cite Search Fix data