Dynamic Evil Score-Guided Decoding: An Efficient Decoding Framework For Red-Team Model

Cong Gao; Bo Zhang (波章,); Linkang Yang; Minghao Hu; Zhunchen Luo; Xiaoying Bai; Guotong Geng; Jun Zhang; Yunhua Xue

doi:10.18653/v1/2025.findings-acl.564

Dynamic Evil Score-Guided Decoding: An Efficient Decoding Framework For Red-Team Model

Cong Gao, Bo Zhang, Linkang Yang, Minghao Hu, Zhunchen Luo, Xiaoying Bai, Guotong Geng, Jun Zhang, Yunhua Xue

Abstract

Large language models (LLMs) have achieved significant advances but can potentially generate harmful content such as social biases, extremism, and misinformation. Red teaming is a promising approach to enhance model safety by creating adversarial prompts to test and improve model robustness. However, existing red-teaming methods often require expensive fine-tuning, especially for large LLMs. We propose the Dynamic Evil Score-Guided Decoding framework (DESGD), an efficient red-teaming method that does not increase computational cost with the target model size. DESGD introduces the concept of an ‘evil score’ to dynamically evaluate the potential of tokens to contribute to harmful outputs during decoding. This framework constructs a small unsafe model using an adversarial dataset and adjusts the logits vector of the target model based on the evil score. Experiments show that DESGD achieves an ASR of 92.83% on the Llama-3.2-3B-Instruct model, compared to 83.48% with adversarial fine-tuning while using less computational resources. Similarly, on the Qwen2.5-3B-Instruct model, DESGD reaches an ASR of 88.62%, outperforming adversarial fine-tuning (77.56%).

Anthology ID:: 2025.findings-acl.564
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10817–10833
Language:
URL:: https://aclanthology.org/2025.findings-acl.564/
DOI:: 10.18653/v1/2025.findings-acl.564
Bibkey:
Cite (ACL):: Cong Gao, Bo Zhang, Linkang Yang, Minghao Hu, Zhunchen Luo, Xiaoying Bai, Guotong Geng, Jun Zhang, and Yunhua Xue. 2025. Dynamic Evil Score-Guided Decoding: An Efficient Decoding Framework For Red-Team Model. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10817–10833, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Dynamic Evil Score-Guided Decoding: An Efficient Decoding Framework For Red-Team Model (Gao et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.564.pdf

PDF Cite Search Fix data