Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs

Xiangwen Wang, Jie Peng, Kaidi Xu, Huaxiu Yao, Tianlong Chen


Abstract
Recently, there has been a growing focus on conducting attacks on large language models (LLMs) to assess LLMs’ safety. Yet, existing attack methods face challenges, including the need to access model weights or merely ensuring LLMs output harmful information without controlling the specific content of their output. Exactly control of the LLM output can produce more inconspicuous attacks which could reveal a new page for LLM security. To achieve this, we propose RLTA: the Reinforcement Learning Targeted Attack, a framework that is designed for attacking language models (LLMs) and is adaptable to both white box (weight accessible) and black box (weight inaccessible) scenarios. It is capable of automatically generating malicious prompts that trigger target LLMs to produce specific outputs. We demonstrate RLTA in two different scenarios: LLM trojan detection and jailbreaking. The comprehensive experimental results show the potential of RLTA in enhancing the security measures surrounding contemporary LLMs.
Anthology ID:
2024.privatenlp-1.17
Volume:
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, Oluwaseyi Feyisetan
Venues:
PrivateNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
170–177
Language:
URL:
https://aclanthology.org/2024.privatenlp-1.17
DOI:
Bibkey:
Cite (ACL):
Xiangwen Wang, Jie Peng, Kaidi Xu, Huaxiu Yao, and Tianlong Chen. 2024. Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs. In Proceedings of the Fifth Workshop on Privacy in Natural Language Processing, pages 170–177, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs (Wang et al., PrivateNLP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.privatenlp-1.17.pdf