Smaller Large Language Models Can Do Moral Self-Correction

Guangliang Liu; Zhiyu Xue; Xitong Zhang; Rongrong Wang; Kristen Johnson

doi:10.18653/v1/2025.trustnlp-main.5

Smaller Large Language Models Can Do Moral Self-Correction

Guangliang Liu, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Johnson

Abstract

Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs), enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the language modeling ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction.However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms.In this paper, we empirically validate this hypothesis in the context of social stereotyping, through meticulous prompting.Our experimental results indicate that (i) surprisingly, 3.8B LLMs with proper safety alignment fine-tuning can achieve very good moral self-correction performance, highlighting the significant effects of safety alignment; and (ii) small LLMs are indeed weaker than larger-scale models in terms of comprehending social norms and self-explanation through CoT, but all scales of LLMs show bad self-correction performance given unethical instructions.

Anthology ID:: 2025.trustnlp-main.5
Volume:: Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 56–65
Language:
URL:: https://aclanthology.org/2025.trustnlp-main.5/
DOI:: 10.18653/v1/2025.trustnlp-main.5
Bibkey:
Cite (ACL):: Guangliang Liu, Zhiyu Xue, Xitong Zhang, Rongrong Wang, and Kristen Johnson. 2025. Smaller Large Language Models Can Do Moral Self-Correction. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 56–65, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Smaller Large Language Models Can Do Moral Self-Correction (Liu et al., TrustNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.trustnlp-main.5.pdf

PDF Cite Search Fix data