Toward Optimal LLM Alignments Using Two-Player Games

Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Yang Liu, Hang Li


Abstract
Alignment of large language models (LLM) is a process that ensures the model’s responses to user prompts align with human intentions and social values. This optimization typically relies on pre-collected prompts. The collection of these prompts often either requires careful human interventions or proves to be difficult to have a good coverage over all scenarios an LLM can improve over . To address this issue, we propose an alignment method based on a two-agent game, consisting of an adversarial agent and a defensive agent. The adversarial agent’s task is to generate prompts that expose the deficiencies of the defensive agent. At the same time, the defensive agent improves its performance on the prompts generated by the adversary based on feedback from the reward model. This iterative process is repeated to enhance the model’s performance. We theoretically demonstrate that, under mild assumptions, this iterative alignment process converges to a Nash equilibrium by both agents. Learning in this competitive environment results in policies with better generalization capabilities. We demonstrate the advantage of our framework using extensive experiments.
Anthology ID:
2025.findings-emnlp.6
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–99
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.6/
DOI:
Bibkey:
Cite (ACL):
Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Yang Liu, and Hang Li. 2025. Toward Optimal LLM Alignments Using Two-Player Games. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 78–99, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Toward Optimal LLM Alignments Using Two-Player Games (Zheng et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.6.pdf
Checklist:
 2025.findings-emnlp.6.checklist.pdf