ShieldHead: Decoding-time Safeguard for Large Language Models

Zitao Xuan; Xiaofeng Mao; Da Chen; Xin Zhang; Yuhan Dong; Jun Zhou

doi:10.18653/v1/2025.findings-acl.932

ShieldHead: Decoding-time Safeguard for Large Language Models

Zitao Xuan, Xiaofeng Mao, Da Chen, Xin Zhang, Yuhan Dong, Jun Zhou

Abstract

In light of the widespread deployment of Large Language Models (LLMs), the responsibility for safeguarding and regulating LLM-generated content has taken on heightened significance. Recent advancements in LLM-based moderation methods, e.g., LlamaGuard, have demonstrated remarkable promise in identifying safety risks associated with both inputs and outputs in human-AI interactions. However, integrating LLM-based safeguards into a chatbot system requires an additional inference stage involving a moderation LLM with billions of parameters, which significantly increases computational costs and reduces overall efficiency. In this paper, we demonstrate that simply learning a classification head on the last-layer hidden states of the dialogue model provides a strong capability to identify harmful contents. The classification head, referred to as ShieldHead, serves as an auxiliary branch paralleled with next-token-prediction LM head, enabling the detection of potential risks in past text sequences. Additionally, a label disambiguation technique is employed to supervise ShieldHead with both token-level and sentence-level labels, which further enhances its performance. ShieldHead exhibits remarkable efficiency during inference, providing real-time moderation results alongside token-wise streaming output during the chatbot system’s decoding phase. Extensive experimental results demonstrate the superiority of the proposed framework: a state-of-the-art performance on the XSTest and SafeRLHF datasets while running at a speed about **300×** faster (**<1ms**) than previous LLM-based moderation models with ** 99%** less parameters of LlamaGuard.

Anthology ID:: 2025.findings-acl.932
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18129–18143
Language:
URL:: https://aclanthology.org/2025.findings-acl.932/
DOI:: 10.18653/v1/2025.findings-acl.932
Bibkey:
Cite (ACL):: Zitao Xuan, Xiaofeng Mao, Da Chen, Xin Zhang, Yuhan Dong, and Jun Zhou. 2025. ShieldHead: Decoding-time Safeguard for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18129–18143, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: ShieldHead: Decoding-time Safeguard for Large Language Models (Xuan et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.932.pdf

PDF Cite Search Fix data