Hongzhi Wen

2024

With the recent advancement of Large Language Models (LLMs), efforts have been made to leverage LLMs in crucial social science study methods, including predicting human features of social life such as presidential voting. Existing works suggest that LLMs are capable of generating human-like responses. Nevertheless, it is unclear how well LLMs work and where the plausible predictions derive from. This paper critically examines the performance of LLMs as social predictors, pointing out the source of correct predictions and limitations. Based on the notion of mutability that classifies social features, we design three realistic settings and a novel social prediction task, where the LLMs make predictions with input features of the same mutability and accessibility with the response feature. We find that the promising performance achieved by previous studies is because of input shortcut features to the response, which are hard to capture in reality; the performance degrades dramatically to near-random after removing the shortcuts. With the comprehensive investigations on various LLMs, we reveal that LLMs struggle to work as expected on social prediction when given ordinarily available input features without shortcuts. We further investigate possible reasons for this phenomenon and suggest potential ways to enhance LLMs for social prediction.

pdf bib abs
IterAlign: Iterative Constitutional Alignment of Large Language Models
Xiusi Chen | Hongzhi Wen | Sreyashi Nag | Chen Luo | Qingyu Yin | Ruirui Li | Zheng Li | Wei Wang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to 13.5% in harmlessness.

Co-authors

Venues

findings1
naacl1

Fix author