Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Sooyung Choi; Jaehyeok Lee; Xiaoyuan Yi; Jing Yao; Xing Xie; JinYeong Bak

doi:10.18653/v1/2025.acl-long.1532

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak

Abstract

The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the “black box” of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.

Anthology ID:: 2025.acl-long.1532
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31742–31768
Language:
URL:: https://aclanthology.org/2025.acl-long.1532/
DOI:: 10.18653/v1/2025.acl-long.1532
Bibkey:
Cite (ACL):: Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, and JinYeong Bak. 2025. Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31742–31768, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights (Choi et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1532.pdf

PDF Cite Search Fix data