Kaiwen Lu
2025
Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur
Kaiwen Lu
|
Yating Yang
|
Fengyi Yang
|
Rui Dong
|
Bo Ma
|
Aihetamujiang Aihemaiti
|
Abibilla Atawulla
|
Lei Wang
|
Xi Zhou
Proceedings of the 31st International Conference on Computational Linguistics
Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.
Search
Fix data
Co-authors
- Aihetamujiang Aihemaiti 1
- Abibilla Atawulla 1
- Rui Dong (董瑞) 1
- Bo Ma (马博) 1
- Lei Wang (王雷) 1
- show all...