Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, Muhao Chen


Abstract
While large language models (LLMs) have demonstrated increasing power, they have also called upon studies on their vulnerabilities. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of 1) multilingual cognitive overload, 2) veiled expression, and 3) effect-to- cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.
Anthology ID:
2024.findings-naacl.224
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3526–3548
Language:
URL:
https://aclanthology.org/2024.findings-naacl.224
DOI:
10.18653/v1/2024.findings-naacl.224
Bibkey:
Cite (ACL):
Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen. 2024. Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3526–3548, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking (Xu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.224.pdf