Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong


Abstract
Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models, Gemini and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.
Anthology ID:
2024.acl-long.461
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8483–8502
Language:
URL:
https://aclanthology.org/2024.acl-long.461
DOI:
Bibkey:
Cite (ACL):
Yu Fu, Yufei Li, Wen Xiao, Cong Liu, and Yue Dong. 2024. Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8483–8502, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack (Fu et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.461.pdf