Kaiwen Lu


2025

pdf bib
Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur
Kaiwen Lu | Yating Yang | Fengyi Yang | Rui Dong | Bo Ma | Aihetamujiang Aihemaiti | Abibilla Atawulla | Lei Wang | Xi Zhou
Proceedings of the 31st International Conference on Computational Linguistics

Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.