VRCP: Vocabulary Replacement Continued Pretraining for Efficient Multilingual Language Models

Yuta Nozaki; Dai Nakashima; Ryo Sato; Naoki Asaba

VRCP: Vocabulary Replacement Continued Pretraining for Efficient Multilingual Language Models

Yuta Nozaki, Dai Nakashima, Ryo Sato, Naoki Asaba

Abstract

Building large language models (LLMs) for non-English languages involves leveraging extensively trained English models through continued pre-training on the target language corpora. This approach harnesses the rich semantic knowledge embedded in English models, allowing superior performance compared to training from scratch. However, tokenizers not optimized for the target language may make inefficiencies in training. We propose Vocabulary Replacement Continued Pretraining (VRCP), a method that optimizes the tokenizer for the target language by replacing unique (solely available) vocabulary from the source tokenizer while maintaining the overall vocabulary size. This approach preserves the semantic knowledge of the source model while enhancing token efficiency and performance for the target language. We evaluated VRCP using the Llama-2 model on Japanese and Chinese corpora. The results show that VRCP matches the performance of vocabulary expansion methods on benchmarks and achieves superior performance in summarization tasks. Additionally, VRCP provides an optimized tokenizer that balances token efficiency, task performance, and GPU memory footprint, making it particularly suitable for resource-constrained environments.

Anthology ID:: 2025.sumeval-2.5
Volume:: Proceedings of the Second Workshop on Scaling Up Multilingual & Multi-Cultural Evaluation
Month:: January
Year:: 2025
Address:: Abu Dhabi
Venues:: SUMEval | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 48–59
Language:
URL:: https://aclanthology.org/2025.sumeval-2.5/
DOI:
Bibkey:
Cite (ACL):: Yuta Nozaki, Dai Nakashima, Ryo Sato, and Naoki Asaba. 2025. VRCP: Vocabulary Replacement Continued Pretraining for Efficient Multilingual Language Models. In Proceedings of the Second Workshop on Scaling Up Multilingual & Multi-Cultural Evaluation, pages 48–59, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):: VRCP: Vocabulary Replacement Continued Pretraining for Efficient Multilingual Language Models (Nozaki et al., SUMEval 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.sumeval-2.5.pdf

PDF Cite Search Fix data