LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

Hayder Elesedy, Pedro Esperanca, Silviu Vlad Oprea, Mete Ozay


Abstract
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
Anthology ID:
2024.emnlp-main.656
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11746–11765
Language:
URL:
https://aclanthology.org/2024.emnlp-main.656
DOI:
Bibkey:
Cite (ACL):
Hayder Elesedy, Pedro Esperanca, Silviu Vlad Oprea, and Mete Ozay. 2024. LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11746–11765, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models (Elesedy et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.656.pdf