Peijian Zeng
2025
Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models
Nankai Lin
|
Peijian Zeng
|
Weixiong Zheng
|
Shengyi Jiang
|
Dong Zhou
|
Aimin Yang
Proceedings of the 31st International Conference on Computational Linguistics
The performance of multilingual language models (MLLMs) is notably inferior for low-resource languages (LRL) compared to high-resource ones, primarily due to the limited available corpus during the pre-training phase. This inadequacy stems from the under-representation of low-resource language words in the subword vocabularies of MLLMs, leading to their misidentification as unknown or incorrectly concatenated subwords. Previous approaches are based on frequency sorting to select words for augmenting vocabularies. However, these methods overlook the fundamental disparities between model representation distributions and frequency distributions. To address this gap, we introduce a novel Entropy-Consistency Word Selection (ECWS) method, which integrates semantic and frequency metrics for vocabulary augmentation. Our results indicate an improvement in performance, supporting our approach as a viable means to enrich vocabularies inadequately represented in current MLLMs.