CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Gao Zuchen, Fei Mi, Lanqing Hong


Abstract
As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research issue. Previous red teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.
Anthology ID:
2024.emnlp-main.968
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17494–17508
Language:
URL:
https://aclanthology.org/2024.emnlp-main.968
DOI:
Bibkey:
Cite (ACL):
Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Gao Zuchen, Fei Mi, and Lanqing Hong. 2024. CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17494–17508, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference (Yu et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.968.pdf