Hyunwoo Lee
2024
SLM as Guardian: Pioneering AI Safety with Small Language Model
Ohjoon Kwon
|
Donghyeon Jeon
|
Nayoung Choi
|
Gyu-Hwung Cho
|
Hwiyeol Jo
|
Changbong Kim
|
Hyunwoo Lee
|
Inho Kang
|
Sun Kim
|
Taiwoo Park
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of their use cases. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
Search
Fix data
Co-authors
- Gyu-Hwung Cho 1
- Nayoung Choi 1
- Donghyeon Jeon 1
- Hwiyeol Jo 1
- Inho Kang 1
- show all...