Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models

Nankai Lin; Peijian Zeng; Weixiong Zheng; Shengyi Jiang; Dong Zhou; Aimin Yang

Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models

Nankai Lin, Peijian Zeng, Weixiong Zheng, Shengyi Jiang, Dong Zhou, Aimin Yang

Abstract

The performance of multilingual language models (MLLMs) is notably inferior for low-resource languages (LRL) compared to high-resource ones, primarily due to the limited available corpus during the pre-training phase. This inadequacy stems from the under-representation of low-resource language words in the subword vocabularies of MLLMs, leading to their misidentification as unknown or incorrectly concatenated subwords. Previous approaches are based on frequency sorting to select words for augmenting vocabularies. However, these methods overlook the fundamental disparities between model representation distributions and frequency distributions. To address this gap, we introduce a novel Entropy-Consistency Word Selection (ECWS) method, which integrates semantic and frequency metrics for vocabulary augmentation. Our results indicate an improvement in performance, supporting our approach as a viable means to enrich vocabularies inadequately represented in current MLLMs.

Anthology ID:: 2025.coling-main.197
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2919–2934
Language:
URL:: https://aclanthology.org/2025.coling-main.197/
DOI:
Bibkey:
Cite (ACL):: Nankai Lin, Peijian Zeng, Weixiong Zheng, Shengyi Jiang, Dong Zhou, and Aimin Yang. 2025. Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2919–2934, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models (Lin et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.197.pdf

PDF Cite Search Fix data