Challenge Track: JHARNA-MT: A Copy-Augmented Hybrid of LoRA-Tuned NLLB and Lexical SMT with Minimum Bayes Risk Decoding for Low-Resource Indic Languages

Dao Sy Duy Minh; Trung Kiet Huynh; Tran Chi Nguyen; Phu Quy Nguyen Lam; Phu-Hoa Pham; Nguyễn Đình Hà Dương; Dinh Dien; Long HB Nguyen

Challenge Track: JHARNA-MT: A Copy-Augmented Hybrid of LoRA-Tuned NLLB and Lexical SMT with Minimum Bayes Risk Decoding for Low-Resource Indic Languages

Dao Sy Duy Minh, Trung Kiet Huynh, Tran Chi Nguyen, Phu Quy Nguyen Lam, Phu-Hoa Pham, Nguyễn Đình Hà Dương, Dien Dinh, Long HB Nguyen

Abstract

This paper describes JHARNA-MT, our system for the MMLoSo 2025 Shared Task on translation between high-resource languages (Hindi, English) and four low-resource Indic tribal languages: Bhili, Gondi, Mundari, and Santali. The task poses significant challenges, including data sparsity, morphological richness, and structural divergence across language pairs. To address these, we propose a hybrid translation pipeline that integrates non-parametric retrieval, lexical statistical machine translation (SMT), and LoRA-tuned NLLB-200 neural machine translation under a unified Minimum Bayes Risk (MBR) decoding framework. Exact and fuzzy retrieval exploit redundancy in government and administrative texts, SMT with diagonal alignment priors and back-translation provides lexically faithful hypotheses, and the NLLB-LoRA component contributes fluent neural candidates. MBR decoding selects consensus translations using a metric-matched utility based on a weighted combination of BLEU and chrF, mitigating the complementary error modes of SMT and NMT. Our final system, further enhanced with script-aware digit normalization and entity-preserving post-processing, achieves a private leaderboard score of 186.37 and ranks 2nd overall in the shared task, with ablation studies confirming the contribution of each component.

Anthology ID:: 2025.mmloso-1.13
Volume:: Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Ankita Shukla, Sandeep Kumar, Amrit Singh Bedi, Tanmoy Chakraborty
Venues:: MMLoSo | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 114–120
Language:
URL:: https://aclanthology.org/2025.mmloso-1.13/
DOI:
Bibkey:
Cite (ACL):: Dao Sy Duy Minh, Trung Kiet Huynh, Tran Chi Nguyen, Phu Quy Nguyen Lam, Phu-Hoa Pham, Nguyễn Đình Hà Dương, Dien Dinh, and Long HB Nguyen. 2025. Challenge Track: JHARNA-MT: A Copy-Augmented Hybrid of LoRA-Tuned NLLB and Lexical SMT with Minimum Bayes Risk Decoding for Low-Resource Indic Languages. In Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025), pages 114–120, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Challenge Track: JHARNA-MT: A Copy-Augmented Hybrid of LoRA-Tuned NLLB and Lexical SMT with Minimum Bayes Risk Decoding for Low-Resource Indic Languages (Minh et al., MMLoSo 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.mmloso-1.13.pdf

PDF Cite Search Fix data