Adversarial Preference Learning for Robust LLM Alignment

Yuanfu Wang; Pengyu Wang; Chenyang Xi; Bo Tang; Junyi Zhu; Wenqiang Wei; Chen Chen; Chao Yang; Jingfeng Zhang; Chaochao Lu; Yijun Niu; Keming Mao; Zhiyu Li; Feiyu Xiong; Jie Hu; Mingchuan Yang

doi:10.18653/v1/2025.findings-acl.1126

Adversarial Preference Learning for Robust LLM Alignment

Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, Mingchuan Yang

Abstract

Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model’s intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.

Anthology ID:: 2025.findings-acl.1126
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21865–21881
Language:
URL:: https://aclanthology.org/2025.findings-acl.1126/
DOI:: 10.18653/v1/2025.findings-acl.1126
Bibkey:
Cite (ACL):: Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, and Mingchuan Yang. 2025. Adversarial Preference Learning for Robust LLM Alignment. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21865–21881, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Adversarial Preference Learning for Robust LLM Alignment (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1126.pdf

PDF Cite Search Fix data