Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur

Kaiwen Lu, Yating Yang, Fengyi Yang, Rui Dong, Bo Ma, Aihetamujiang Aihemaiti, Abibilla Atawulla, Lei Wang, Xi Zhou


Abstract
Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.
Anthology ID:
2025.coling-main.559
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8360–8373
Language:
URL:
https://aclanthology.org/2025.coling-main.559/
DOI:
Bibkey:
Cite (ACL):
Kaiwen Lu, Yating Yang, Fengyi Yang, Rui Dong, Bo Ma, Aihetamujiang Aihemaiti, Abibilla Atawulla, Lei Wang, and Xi Zhou. 2025. Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8360–8373, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur (Lu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.559.pdf