Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji; Kaile Wang; Tianyi Alex Qiu; Boyuan Chen; Jiayi Zhou; Changye Li; Hantao Lou; Josef Dai; Yunhuai Liu; Yaodong Yang (杨耀东)

doi:10.18653/v1/2025.acl-long.1141

Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji, Kaile Wang, Tianyi Alex Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Josef Dai, Yunhuai Liu, Yaodong Yang

Abstract

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

Anthology ID:: 2025.acl-long.1141
Original:: 2025.acl-long.1141v1
Version 2:: 2025.acl-long.1141v2
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23411–23432
Language:
URL:: https://aclanthology.org/2025.acl-long.1141/
DOI:: 10.18653/v1/2025.acl-long.1141
Award:: Best Paper
Bibkey:
Cite (ACL):: Jiaming Ji, Kaile Wang, Tianyi Alex Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Josef Dai, Yunhuai Liu, and Yaodong Yang. 2025. Language Models Resist Alignment: Evidence From Data Compression. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23411–23432, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Language Models Resist Alignment: Evidence From Data Compression (Ji et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1141.pdf

PDF (v2) PDF (v1) Cite Search Fix data