Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation

Samin Mahdizadeh Sani, Pouya Sadeghi, Thuy-Trang Vu, Yadollah Yaghoobzadeh, Gholamreza Haffari


Abstract
Large language models (LLMs) have made great progress in classification and text generation tasks. However, they are mainly trained on English data and often struggle with low-resource languages. In this study, we explore adding a new language, i.e., Persian, to Llama (a model with a limited understanding of Persian) using parameter-efficient fine-tuning. We employ a multi-stage approach involving pretraining on monolingual Persian data, aligning representations through bilingual pretraining and instruction datasets, and instruction-tuning with task-specific datasets. We evaluate the model’s performance at each stage on generation and classification tasks. Our findings suggest that incorporating the Persian language, through bilingual data alignment, can enhance classification accuracy for Persian tasks, with no adverse impact and sometimes even improvements on English tasks. Additionally, the results highlight the model’s initial strength as a critical factor when working with limited training data, with cross-lingual alignment offering minimal benefits for the low-resource language. Knowledge transfer from English to Persian has a marginal effect, primarily benefiting simple classification tasks.
Anthology ID:
2025.coling-main.594
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8868–8884
Language:
URL:
https://aclanthology.org/2025.coling-main.594/
DOI:
Bibkey:
Cite (ACL):
Samin Mahdizadeh Sani, Pouya Sadeghi, Thuy-Trang Vu, Yadollah Yaghoobzadeh, and Gholamreza Haffari. 2025. Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8868–8884, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation (Mahdizadeh Sani et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.594.pdf