SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Wenjie Yang; Mao Zheng; Mingyang Song; Zheng Li; Sitong Wang

SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Wenjie Yang, Mao Zheng, Mingyang Song, Zheng Li, Sitong Wang

Abstract

Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs rely heavily on external supervision during training, such as human-annotated reference data or trained reward models (RMs), which are expensive to obtain and difficult to scale. To address this limitation, we propose **Simple Self-Rewarding (SSR)**, a reinforcement learning (RL) framework for MT that is reference-free and relies solely on self-judging rewards. Using only 13K monolingual examples and Qwen-2.5-7B as the backbone, SSR-Zero-7B outperforms existing MT-specific LLMs as well as larger general LLMs such as Qwen2.5-32B-Instruct on English ↔ Chinese translation benchmarks including WMT23, WMT24, and FLORES200. It further demonstrates strong generalization to low-resource language pairs. In addition, when augmented with external supervision from COMET, our strongest model, SSR-X-Zero-7B, surpasses all existing open-source models under 72B parameters and performs competitively with leading closed-source systems in English ↔ Chinese translation. Our analysis highlights the effectiveness and generalizability of the self-rewarding mechanism relative to external LLM-as-a-judge approaches and demonstrates its complementary benefits when combined with trained RMs. We will publicly release our code, data, and models.

Anthology ID:: 2026.findings-acl.300
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6039–6052
Language:
URL:: https://aclanthology.org/2026.findings-acl.300/
DOI:
Bibkey:
Cite (ACL):: Wenjie Yang, Mao Zheng, Mingyang Song, Zheng Li, and Sitong Wang. 2026. SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6039–6052, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation (Yang et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.300.pdf
Checklist:: 2026.findings-acl.300.checklist.pdf

PDF Cite Search Checklist Fix data