Monte Carlo Tree Search Based Prompt Autogeneration for Jailbreak Attacks against LLMs

Suhuang Wu, Huimin Wang, Yutian Zhao, Xian Wu, Yefeng Zheng, Wei Li, Hui Li, Rongrong Ji


Abstract
Jailbreak attacks craft specific prompts or append adversarial suffixes to prompts, thereby inducing language models to generate harmful or unethical content and bypassing the model’s safety guardrails. With the recent blossom of large language models (LLMs), there’s a growing focus on jailbreak attacks to probe their safety. While current white-box attacks typically focus on meticulously identifying adversarial suffixes for specific models, their effectiveness and efficiency diminish when applied to different LLMs. In this paper, we propose a Monte Carlo Tree Search (MCTS) based Prompt Auto-generation (MPA) method to enhance the effectiveness and efficiency of attacks across various models. MPA automatically searches for and generates adversarial suffixes for valid jailbreak attacks. Specifically, we first identify a series of action candidates that could potentially trick LLMs into providing harmful responses. To streamline the exploration of adversarial suffixes, we design a prior confidence probability for each MCTS node. We then iteratively auto-generate adversarial prompts using the MCTS framework. Extensive experiments on multiple open-source models (like Llama, Gemma, and Mistral) and closed-source models (such as ChatGPT) show that our proposed MPA surpasses existing methods in search efficiency as well as attack effectiveness. The codes are available at https://github.com/KDEGroup/MPA.
Anthology ID:
2025.coling-main.71
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1057–1068
Language:
URL:
https://aclanthology.org/2025.coling-main.71/
DOI:
Bibkey:
Cite (ACL):
Suhuang Wu, Huimin Wang, Yutian Zhao, Xian Wu, Yefeng Zheng, Wei Li, Hui Li, and Rongrong Ji. 2025. Monte Carlo Tree Search Based Prompt Autogeneration for Jailbreak Attacks against LLMs. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1057–1068, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Monte Carlo Tree Search Based Prompt Autogeneration for Jailbreak Attacks against LLMs (Wu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.71.pdf