Bingxu Han
2025
SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models
Bo Zhang
|
Cong Gao
|
Linkang Yang
|
Bingxu Han
|
Minghao Hu
|
Zhunchen Luo
|
Guotong Geng
|
Xiaoying Bai
|
Jun Zhang
|
Wen Yao
|
Zhong Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have achieved groundbreaking progress in Natural Language Processing (NLP). Despite the numerous advantages of LLMs, they also pose significant safety risks. Self-evaluation mechanisms have gained increasing attention as a key safeguard to ensure safe and controllable content generation. However, LLMs often exhibit overconfidence, which seriously compromises the accuracy of safety self-evaluation. To address this challenge, we propose SafeConf, a method to enhance the safety self-evaluation capability of LLMs through confidence calibration. The method performs semantic mutations on the original safety evaluation questions and adopts a self-consistency strategy to quantify confidence based on answer accuracy on the mutated questions. Finally, these confidence scores are used to construct a dataset for fine-tuning. We conducte experiments on both Chinese and English datasets. The results show that SafeConf improves self-evaluation accuracy by an average of 5.86% and 7.79% over the state-of-the-art baseline methods on Qwen2.5-7B-Instruct and Llama3-8B-Instruct models, respectively, without affecting the general capabilities of the models.
Search
Fix author
Co-authors
- Xiaoying Bai 1
- Cong Gao 1
- Guotong Geng 1
- Minghao Hu 1
- Zhunchen Luo 1
- show all...