Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness

Xiaoyu Tan; Shaojie Shi; Xihe Qiu; Chao Qu; Zhenting Qi; Yinghui Xu; Yuan Qi

doi:10.18653/v1/2023.emnlp-industry.62

Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness

Xiaoyu Tan, Shaojie Shi, Xihe Qiu, Chao Qu, Zhenting Qi, Yinghui Xu, Yuan Qi

Abstract

Recently, there has been a notable surge in the significance of large language models (LLMs) that engage in conversational-style interactions, such as ChatGPT and Claude, as they contribute significantly to the progress of artificial general intelligence (AGI). Typically, these models undergo a two-phase fine-tuning process: instruction fine-tuning (IF) and reinforcement learning from human feedback (RLHF). These methods aim to align the LLMs to be helpful, honest, and harmless (HHH). However, RLHF, which incorporates independent reward models trained on high-quality human feedback datasets, incurs high costs in terms of hardware resources and human efforts. Therefore, we explore the possibility of aligning LLMs with their own understanding of HHH through IF and in-context learning (ICL). In this study, we propose a novel framework called Self-Criticism, which allows LLMs to align themselves with HHH based on the definition they learned from a large-scale text corpus. We begin by employing IF on a given instruction set and learning HHH discrimination through few-shot ICL. Subsequently, the LLMs evaluate their own generated responses and learn to produce “better” responses based on self-judgment. Finally, the model is retrained based on the self-generated responses to distill the whole process. By analyzing our proposed method, we also find interesting connections between Self-Criticism and goal-conditioned reinforcement learning, and pseudo-labeling. Experimental results demonstrate that this method achieves nearly identical performance to RLHF in terms of both human evaluation and evaluation by other LLMs, with only a minimal alignment tax.

Anthology ID:: 2023.emnlp-industry.62
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Mingxuan Wang, Imed Zitouni
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 650–662
Language:
URL:: https://aclanthology.org/2023.emnlp-industry.62
DOI:: 10.18653/v1/2023.emnlp-industry.62
Bibkey:
Cite (ACL):: Xiaoyu Tan, Shaojie Shi, Xihe Qiu, Chao Qu, Zhenting Qi, Yinghui Xu, and Yuan Qi. 2023. Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 650–662, Singapore. Association for Computational Linguistics.
Cite (Informal):: Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness (Tan et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-industry.62.pdf
Video:: https://aclanthology.org/2023.emnlp-industry.62.mp4

PDF Cite Search Video