Do-Not-Answer: Evaluating Safeguards in LLMs

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin


Abstract
With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation. Our data and code are available at https://github.com/Libr-AI/do-not-answer
Anthology ID:
2024.findings-eacl.61
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
896–911
Language:
URL:
https://aclanthology.org/2024.findings-eacl.61
DOI:
Bibkey:
Cite (ACL):
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. 2024. Do-Not-Answer: Evaluating Safeguards in LLMs. In Findings of the Association for Computational Linguistics: EACL 2024, pages 896–911, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Do-Not-Answer: Evaluating Safeguards in LLMs (Wang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.61.pdf