Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan, Victor Li, Qi Lei


Abstract
Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection, a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.47% to 5.86%, to Llama-3.1-8B base from 89.70% to 5.56%, and to Qwen2.5-7B-Instruct from 44.44% to 3.84%, without additional training. Furthermore, our method maintains their original performance across diverse tasks, including summarization, general knowledge, reasoning, and mathematics. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input’s risk profile.
Anthology ID:
2025.findings-emnlp.503
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9466–9483
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.503/
DOI:
Bibkey:
Cite (ACL):
Hoang Phan, Victor Li, and Qi Lei. 2025. Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9466–9483, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection (Phan et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.503.pdf
Checklist:
 2025.findings-emnlp.503.checklist.pdf