Aligning Generative Language Models with Human Values

Ruibo Liu, Ge Zhang, Xinyu Feng, Soroush Vosoughi


Abstract
Although current large-scale generative language models (LMs) can show impressive insights about factual knowledge, they do not exhibit similar success with respect to human values judgements (e.g., whether or not the generations of an LM are moral). Existing methods learn human values either by directly mimicking the behavior of human data, or rigidly constraining the generation space to human-chosen tokens. These methods are inherently limited in that they do not consider the contextual and abstract nature of human values and as a result often fail when dealing with out-of-domain context or sophisticated and abstract human values.This paper proposes SENSEI, a new reinforcement learning based method that can embed human values judgements into each step of language generation. SENSEI deploys an Actor-Critic framework, where the Critic is a reward distributor that simulates the reward assignment procedure of humans, while the Actor guides the generation towards the maximum reward direction. Compared with five existing methods in three human values alignment datasets, SENSEI not only achieves higher alignment performance in terms of both automatic and human evaluations, but also shows improvements on robustness and transfer learning on unseen human values.
Anthology ID:
2022.findings-naacl.18
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
241–252
Language:
URL:
https://aclanthology.org/2022.findings-naacl.18
DOI:
10.18653/v1/2022.findings-naacl.18
Bibkey:
Cite (ACL):
Ruibo Liu, Ge Zhang, Xinyu Feng, and Soroush Vosoughi. 2022. Aligning Generative Language Models with Human Values. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 241–252, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Aligning Generative Language Models with Human Values (Liu et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-naacl.18.pdf
Video:
 https://aclanthology.org/2022.findings-naacl.18.mp4