Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

Jianqing Zhu; Huang Huang; Zhihang Lin; Juhao Liang; Zhengyang Tang; Khalid Almubarak; Mosen Alharthi; Bang An; Juncai He; Xiangbo Wu; Fei Yu; Junying Chen; Ma Zhuoheng; Yuhao Du; He Zhang; Saied Alshahrani; Emad A. Alghamdi; Lian Zhang; Ruoyu Sun; Haizhou Li; Benyou Wang; Jinchao Xu

doi:10.18653/v1/2025.acl-long.100

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

Jianqing Zhu, Huang Huang, Zhihang Lin, Juhao Liang, Zhengyang Tang, Khalid Almubarak, Mosen Alharthi, Bang An, Juncai He, Xiangbo Wu, Fei Yu, Junying Chen, Ma Zhuoheng, Yuhao Du, He Zhang, Saied Alshahrani, Emad A. Alghamdi, Lian Zhang, Ruoyu Sun, Haizhou Li, Benyou Wang, Jinchao Xu

Abstract

This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or GPT-3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for Arabic LLMs is to utilize Arabic-specific vocabulary in the tokenizer to accelerate decoding. However, using a different vocabulary often leads to degradation of the model’s learned knowledge, since many words become out-of-vocabulary (OOV) at the beginning of training. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion.Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Our model weights are available at: https://github.com/FreedomIntelligence/AraLLaMa.

Anthology ID:: 2025.acl-long.100
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2025–2042
Language:
URL:: https://aclanthology.org/2025.acl-long.100/
DOI:: 10.18653/v1/2025.acl-long.100
Bibkey:
Cite (ACL):: Jianqing Zhu, Huang Huang, Zhihang Lin, Juhao Liang, Zhengyang Tang, Khalid Almubarak, Mosen Alharthi, Bang An, Juncai He, Xiangbo Wu, Fei Yu, Junying Chen, Ma Zhuoheng, Yuhao Du, He Zhang, Saied Alshahrani, Emad A. Alghamdi, Lian Zhang, Ruoyu Sun, Haizhou Li, Benyou Wang, and Jinchao Xu. 2025. Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2025–2042, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion (Zhu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.100.pdf

PDF Cite Search Fix data