Chunkang Zhang
2026
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
Yingzhi Mao | Chunkang Zhang | Junxiang Wang | Xinyan Guan | Boxi Cao | Yaojie Lu | Hongyu Lin | Xianpei Han | Le Sun
Findings of the Association for Computational Linguistics: ACL 2026
Yingzhi Mao | Chunkang Zhang | Junxiang Wang | Xinyan Guan | Boxi Cao | Yaojie Lu | Hongyu Lin | Xianpei Han | Le Sun
Findings of the Association for Computational Linguistics: ACL 2026
Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.
2025
AutoAlign: Get Your LLM Aligned with Minimal Annotations
Xinyu Lu | Dong Xu | Chunkang Zhang | Xinyan Guan | Junxiang Wang | Qingyu Zhang | Pengbo Wang | Yingzhi Mao | Hao Xiang | Xueru Wen | Zichao Li | Yaojie Lu | Hongyu Lin | Le Sun | Xianpei Han
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Xinyu Lu | Dong Xu | Chunkang Zhang | Xinyan Guan | Junxiang Wang | Qingyu Zhang | Pengbo Wang | Yingzhi Mao | Hao Xiang | Xueru Wen | Zichao Li | Yaojie Lu | Hongyu Lin | Le Sun | Xianpei Han
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Automated Alignment refers to a set of algorithms designed to align Large Language Models (LLMs) with human intentions and values while minimizing manual intervention. However, it faces challenges such as algorithmic diversity and excessively convoluted workflows. We present AutoAlign, an open-source toolkit that offers:(1) a unified framework integrating mainstream automated algorithms through a consistent interface, and(2) an accessible workflow supporting one-click execution for prompt synthesis, automatic alignment signal construction, and iterative model training. Our toolkit enables easy reproduction of existing results through extensive benchmarks and facilitates the development of novel approaches via modular components. It includes implementations for both highly efficient inference and training, as well as low-resource training. By standardizing automated alignment methodologies and providing accessible implementations, AutoAlign lowers the barriers to building customized aligned models and supports academic research.
2024
Pattern Shifting or Knowledge Losing? A Forgetting Perspective for Understanding the Effect of Instruction Fine-Tuning
Chunkang Zhang | Boxi Cao | Yaojie Lu | Hongyu Lin | Liu Cao | Ke Zeng | Guanglu Wan | Xunliang Cai | Xianpei Han | Le Sun
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Chunkang Zhang | Boxi Cao | Yaojie Lu | Hongyu Lin | Liu Cao | Ke Zeng | Guanglu Wan | Xunliang Cai | Xianpei Han | Le Sun
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“Instruction Fine-Tuning(IFT) emerges as an essential step of training large language models torobustly carry out tasks of interest. However, there lacks a systematic investigation about theunderlying mechanisms of instruction fine-tuning, particularly on the forgetting phenomenonafter IFT, known as alignment tax. Therefore, to understand the mechanism of IFT from theforgetting perspective, we investigate the alternation of the text pattern and knowledge withinmodels throughout the entire IFT process. Specifically, we restore fine-tuned models to their baseversion by training them on the data sharing a similar distribution with the pre-training corpusand compare their results Our experiment indicates that there is a stage transition of forgettingduring IFT process: (1) Pseudo Forgetting: in this stage, models mainly shift their familiar textpattern away from pre-training data format while the world knowledge is preserved. Consequently,models will recover to their original performance when they are restored to the base version. (2)Actual Forgetting: in this stage, models forget the acquired knowledge as well. Therefore, theyfail to reach the original performance even if they are restored to the base version.”