SAFETY-J: Evaluating Safety with Critique

Yixiu Liu, Yuxiang Zheng, Shijie Xia, Jiajun Li, Yi Tu, Chaoling Song, Pengfei Liu


Abstract
The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-Jemploys an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we have released SAFETY-J’s training protocols, datasets, and code at https://github.com/GAIR-NLP/Safety-J.
Anthology ID:
2024.findings-emnlp.64
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1170–1192
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.64
DOI:
Bibkey:
Cite (ACL):
Yixiu Liu, Yuxiang Zheng, Shijie Xia, Jiajun Li, Yi Tu, Chaoling Song, and Pengfei Liu. 2024. SAFETY-J: Evaluating Safety with Critique. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1170–1192, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
SAFETY-J: Evaluating Safety with Critique (Liu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.64.pdf