I²B-LPO: Latent Policy Optimization via Iterative Information Bottleneck

Huilin Deng; Hongchen Luo; Yue Zhu; Long Li; Zhuoyue Chen; Xinghao Zhao; Ming LI; Chuyang Zhao; Jihai Zhang; MengChang Wang; Yang Cao; Yu Kang

I²B-LPO: Latent Policy Optimization via Iterative Information Bottleneck

Huilin Deng, Hongchen Luo, Yue Zhu, Long Li, Zhuoyue Chen, Xinghao Zhao, Ming LI, Chuyang Zhao, Jihai Zhang, MengChang Wang, Yang Cao, Yu Kang

Abstract

Despite recent advances in Reinforcement learning with verifiable rewards (RLVR) for large language model (LLM) reasoning, most methods suffer from exploration collapse, as the semantic homogeneity of random rollouts traps models in narrow, over-optimized behaviors. Existing methods leverage policy entropy to encourage exploration, but face inherent limitations: global entropy regularization is susceptible to reward hacking, inducing meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To this end, we propose Latent Policy Optimization via Iterative Information Bottleneck ( I²B-LPO), which shifts from statistical perturbation of token distributions to topological branching of reasoning trajectories. I²BLPO triggers latent branching at high-entropy states to diversify reasoning trajectories and applies the Information Bottleneck as a trajectory filter and self-reward to ensure concise and informative exploration. Empirical results on four mathematical benchmarks demonstrate that I²B-LPO achieves state-of-the-art performance, with margins of up to 5.3% in accuracy and 7.4% in diversity metrics. Code is available at https://github.com/denghuilin-cyber/IIB-LPO.

Anthology ID:: 2026.acl-long.1084
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23647–23664
Language:
URL:: https://aclanthology.org/2026.acl-long.1084/
DOI:
Bibkey:
Cite (ACL):: Huilin Deng, Hongchen Luo, Yue Zhu, Long Li, Zhuoyue Chen, Xinghao Zhao, Ming LI, Chuyang Zhao, Jihai Zhang, MengChang Wang, Yang Cao, and Yu Kang. 2026. I²B-LPO: Latent Policy Optimization via Iterative Information Bottleneck. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23647–23664, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: I²B-LPO: Latent Policy Optimization via Iterative Information Bottleneck (Deng et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1084.pdf
Checklist:: 2026.acl-long.1084.checklist.pdf

PDF Cite Search Checklist Fix data