CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment

Jixiang Hong, Quan Tu, Changyu Chen, Gao Xing, Ji Zhang, Rui Yan


Abstract
Language models trained on large-scale corpus often generate harmful responses that are harmful and contrary to human values. A prevalent approach for human alignment is reinforcement learning from human feedback (RLHF), utilizing algorithms such as proximal policy optimization (PPO). However, these methods are often characterized by complexity, instability, and substantial resource consumption. Considering that existing large language models (LLMs) like ChatGPT are already relatively well-aligned and cost-friendly, researchers propose to align the language model with human preferences from AI feedback. Nevertheless, the common practices, that unidirectionally distill the responses, are constrained by the inherent capability of LLMs. To address it, we introduce CycleAlign, a framework that distills alignment capabilities from the parameter-invisible LLMs (black-box) to the parameter-visible models (white-box) in an iterative manner. CycleAlign iteratively improves both the white-box and black-box models by integrating static and dynamic in-context learning and a belief alignment method.Empirical results illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing methods, and achieves the state-of-the-art performance in alignment with human value.
Anthology ID:
2024.findings-acl.869
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14596–14609
Language:
URL:
https://aclanthology.org/2024.findings-acl.869
DOI:
Bibkey:
Cite (ACL):
Jixiang Hong, Quan Tu, Changyu Chen, Gao Xing, Ji Zhang, and Rui Yan. 2024. CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment. In Findings of the Association for Computational Linguistics ACL 2024, pages 14596–14609, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment (Hong et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.869.pdf