Breaking Consensus Bias: Unsupervised Reinforcement Learning for Machine Translation

Shuting Jiang; Ran Song (宋燃); Siqi Zhang; Yuxin Huang (黄于欣, 黄宇欣); Shengxiang Gao; Zhengtao Yu (余正涛)

Breaking Consensus Bias: Unsupervised Reinforcement Learning for Machine Translation

Shuting Jiang, Ran Song, Siqi Zhang, Yuxin Huang, Shengxiang Gao, Zhengtao Yu

Abstract

Reinforcement learning (RL) excels in reasoning tasks with verifiable rewards, while its adaptation to machine translation (MT) remains challenging due to the lack of unique reward signals under multiple valid translations. Existing RL approaches for MT face either fixed references in supervised settings or the production of homogeneous references leading to mode collapse in unsupervised settings. Both limitations arise from ignoring entropy dynamics in RL-based MT. The core challenge is leveraging entropy for supervision construction and self-evolution. In this paper, we propose an Entropy-Driven Unsupervised RL for MT. Our framework integrates entropy-guided sampling for exploration, confidence-weighted label generation to transcend majority-voting bias, and uncertainty-aware optimization to prioritize high-entropy tokens. These mechanisms allow reward signals to co-evolve with model proficiency beyond fixed references. Experiments across multiple language pairs show our method outperforms supervised and unsupervised baselines by +0.63 and +2.52 average points, respectively. Our code is available at https://github.com/fortunatekiss/URLMT.

Anthology ID:: 2026.findings-acl.1042
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20798–20812
Language:
URL:: https://aclanthology.org/2026.findings-acl.1042/
DOI:
Bibkey:
Cite (ACL):: Shuting Jiang, Ran Song, Siqi Zhang, Yuxin Huang, Shengxiang Gao, and Zhengtao Yu. 2026. Breaking Consensus Bias: Unsupervised Reinforcement Learning for Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 20798–20812, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Breaking Consensus Bias: Unsupervised Reinforcement Learning for Machine Translation (Jiang et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1042.pdf
Checklist:: 2026.findings-acl.1042.checklist.pdf

PDF Cite Search Checklist Fix data