Aligning as Debiasing: Causality-Aware Alignment via Reinforcement Learning with Interventional Feedback

Yu Xia, Tong Yu, Zhankui He, Handong Zhao, Julian McAuley, Shuai Li


Abstract
Large language models (LLMs) often generate biased outputs containing offensive, toxic, or stereotypical text. Existing LLM alignment methods such as reinforcement learning from human feedback (RLHF) alleviate biases primarily based on reward signals from current model outputs without considering the source of biases. In this work, to explore how biases are formed, we revisit LLMs’ text generation from a causal perspective. We identify pretraining data and input prompts, which contain semantic correlations of textual phrases, as two confounders between LLMs and model outputs causing biases. Inspired by our causal view, we leverage the reward model in RL alignment as an instrumental variable to perform causal intervention on LLMs. Utilizing the reward difference between an initial LLM and intervened LLM as interventional feedback to guide RL finetuning, we propose Causality-Aware Alignment (CAA) for LLM debiasing. Experiments on two text generation tasks with three different alignment objectives demonstrate the advantages of our method in aligning LLMs to generate less biased and safer outputs.
Anthology ID:
2024.naacl-long.262
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4684–4695
Language:
URL:
https://aclanthology.org/2024.naacl-long.262
DOI:
10.18653/v1/2024.naacl-long.262
Bibkey:
Cite (ACL):
Yu Xia, Tong Yu, Zhankui He, Handong Zhao, Julian McAuley, and Shuai Li. 2024. Aligning as Debiasing: Causality-Aware Alignment via Reinforcement Learning with Interventional Feedback. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4684–4695, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Aligning as Debiasing: Causality-Aware Alignment via Reinforcement Learning with Interventional Feedback (Xia et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.262.pdf