Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages

Yash Bhaskar, Ketaki Shetye, Vandan Mujadia, Dipti Misra Sharma, Parameswari Krishnamurthy


Abstract
This study addresses the critical challenge of data scarcity in machine translation for Indian languages, particularly given their morphological complexity and limited parallel data. We investigate an effective strategy to maximize the utility of existing data by generating negative samples from positive training instances using a progressive perturbation approach. This is used for aligning the model with preferential data using Kahneman-Tversky Optimization (KTO). Comparing it against traditional Supervised Fine-Tuning (SFT), we demonstrate how generating negative samples and leveraging KTO enhances data efficiency. By creating rejected samples through progressively perturbed translations from the available dataset, we fine-tune the Llama 3.1 Instruct 8B model using QLoRA across 16 language directions, including English, Hindi, Bangla, Tamil, Telugu, and Santali. Our results show that KTO-based preference alignment with progressive perturbation consistently outperforms SFT, achieving significant gains in translation quality with an average BLEU increase of 1.84 to 2.47 and CHRF increase of 2.85 to 4.01 compared to SFT for selected languages, while using the same positive training samples and under similar computational constraints. This highlights the potential of our negative sample generation strategy within KTO, especially in low resource scenarios.
Anthology ID:
2025.mtsummit-1.26
Volume:
Proceedings of Machine Translation Summit XX: Volume 1
Month:
June
Year:
2025
Address:
Geneva, Switzerland
Editors:
Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, Sara Szoc
Venue:
MTSummit
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
344–352
Language:
URL:
https://aclanthology.org/2025.mtsummit-1.26/
DOI:
Bibkey:
Cite (ACL):
Yash Bhaskar, Ketaki Shetye, Vandan Mujadia, Dipti Misra Sharma, and Parameswari Krishnamurthy. 2025. Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages. In Proceedings of Machine Translation Summit XX: Volume 1, pages 344–352, Geneva, Switzerland. European Association for Machine Translation.
Cite (Informal):
Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages (Bhaskar et al., MTSummit 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.mtsummit-1.26.pdf