Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models

Weidi Luo; He Cao; Zijing Liu; Yu Wang; Aidan Wong; Bin Feng; Yuan Yao; Yu Li

doi:10.18653/v1/2025.findings-naacl.368

Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models

Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bin Feng, Yuan Yao, Yu Li

Abstract

With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain-specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over-defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi-agents-based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM’s robustness against jailbreak attacks on general and domain-specific scenarios without compromising the model’s general functionality.

Anthology ID:: 2025.findings-naacl.368
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6614–6635
Language:
URL:: https://aclanthology.org/2025.findings-naacl.368/
DOI:: 10.18653/v1/2025.findings-naacl.368
Bibkey:
Cite (ACL):: Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bin Feng, Yuan Yao, and Yu Li. 2025. Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 6614–6635, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models (Luo et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.368.pdf

PDF Cite Search Fix data