Reinforcement Learning for Diffusion LLMs via Energy-Based Gibbs Alignment

Yijia Fan; Jing Yang; Mingyu Liu; Kaitong Cai; Jian Wang; Keze Wang; Jusheng Zhang

Reinforcement Learning for Diffusion LLMs via Energy-Based Gibbs Alignment

Yijia Fan, Jing Yang, Mingyu Liu, Kaitong Cai, Jian Wang, Keze Wang, Jusheng Zhang

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive paradigm for text generation, offering parallel decoding and bidirectional context modeling. However, aligning dLLMs with reinforcement learning (RL) remains a significant challenge, as the marginal likelihood of sequences in masked diffusion is typically intractable, rendering standard policy gradient methods unstable or computationally prohibitive. In this work, we propose **Diffusion-Gibbs Alignment (DGA)**, a novel variational framework that reformulates RL for dLLMs as a distribution matching problem. DGA bypasses the explicit computation of log-probabilities by leveraging a learned energy function to model the relative quality of samples. The optimization is decoupled into two stable steps: (1) contrastive energy ranking to capture global reward structures, and (2) weighted diffusion alignment to update the policy via importance sampling. Empirically, DGA establishes a new state-of-the-art across logical reasoning (Sudoku, Countdown), mathematical reasoning (GSM8K, Math500), and code generation (HumanEval, MBPP) benchmarks. DGA offers a novel variational perspective for dLLM alignment, achieving better performance while simultaneously enhancing training speed and memory efficiency.

Anthology ID:: 2026.acl-long.2131
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 45938–45948
Language:
URL:: https://aclanthology.org/2026.acl-long.2131/
DOI:
Bibkey:
Cite (ACL):: Yijia Fan, Jing Yang, Mingyu Liu, Kaitong Cai, Jian Wang, Keze Wang, and Jusheng Zhang. 2026. Reinforcement Learning for Diffusion LLMs via Energy-Based Gibbs Alignment. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45938–45948, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Reinforcement Learning for Diffusion LLMs via Energy-Based Gibbs Alignment (Fan et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.2131.pdf
Checklist:: 2026.acl-long.2131.checklist.pdf

PDF Cite Search Checklist Fix data