On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment

Zehua Cheng, Manying Zhang, Jiahao Sun, Wei Dai


Abstract
Large language models (LLMs) have made significant advancements, but their increasing capabilities present serious risks of misuse, particularly in open-weight models where direct access to the model’s parameters is possible. Current safeguards, designed for closed-weight API models, are inadequate for open-weight models, as minimal fine-tuning can bypass these protections. Preserving the integrity of open-weight LLMs before deployment has thus become a critical challenge. We argue that these vulnerabilities stem from the overemphasis on maximizing the LLM’s log-likelihood during training, which amplifies data biases, especially with large datasets. To address these issues, we introduce Kahneman and Tversky’s Prospect Theoretic Integrity Preserving Alignment (KT-IPA), a framework that prioritizes maximizing generative utility rather than a singular optimization metric. This approach strengthens LLMs against misuse and weaponization while maintaining high performance, even after extensive fine-tuning. Our results demonstrate that integrating prospect theory into LLM training enhances robustness, security, and responsible innovation in this rapidly evolving field. Our codes are available on https://anonymous.4open.science/r/KT-IPA-40B7
Anthology ID:
2025.coling-main.687
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10309–10324
Language:
URL:
https://aclanthology.org/2025.coling-main.687/
DOI:
Bibkey:
Cite (ACL):
Zehua Cheng, Manying Zhang, Jiahao Sun, and Wei Dai. 2025. On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10309–10324, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment (Cheng et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.687.pdf