Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Nianyi Lin; Jiajie Zhang; Lei Hou; Juanzi Li

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

Abstract

A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation during training. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, they incur significant memory overhead due to the need to retain all MC samples for the gradient computation of non-linear terms in the RL objective, and thus restrict feasible sample sizes, leading to imprecise likelihood approximations and distorted RL objective. To address this, we propose Boundary-Guided Policy Optimization (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, improving likelihood approximations and RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.

Anthology ID:: 2026.acl-long.343
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7539–7550
Language:
URL:: https://aclanthology.org/2026.acl-long.343/
DOI:
Bibkey:
Cite (ACL):: Nianyi Lin, Jiajie Zhang, Lei Hou, and Juanzi Li. 2026. Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7539–7550, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models (Lin et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.343.pdf
Checklist:: 2026.acl-long.343.checklist.pdf

PDF Cite Search Checklist Fix data