The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness

Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral


Abstract
As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This has resulted in the development of various LLM defense strategies. Unfortunately, despite the shared goal of improving the safety of LLMs, the evaluation suites across various research works are disjoint and lack diverse inputs to ensure accurate and precise evaluation estimates. Furthermore, the important factor of ‘over-defensiveness’ on the safe inputs has largely remained overlooked. Addressing these limitations, this paper presents a systematic evaluation, comparison, and analysis of various LLM defense strategies over both ‘safety’ and ‘over-defensiveness’. To this end, we compile a large and diverse collection of safe and unsafe prompts, design precise evaluation methodology, and study the efficacy of various LLM defense strategies on multiple state-of-the-art LLMs. Our work reveals a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the safety of LLMs.
Anthology ID:
2024.findings-acl.776
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13111–13128
Language:
URL:
https://aclanthology.org/2024.findings-acl.776
DOI:
Bibkey:
Cite (ACL):
Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral. 2024. The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness. In Findings of the Association for Computational Linguistics ACL 2024, pages 13111–13128, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness (Varshney et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.776.pdf