Qimin Zhong
2026
Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement
Haiming Qin | Jianxun Lian | Qimin Zhong | Mingyang Zhou | Hao Liao | Naipeng Chao
Findings of the Association for Computational Linguistics: ACL 2026
Haiming Qin | Jianxun Lian | Qimin Zhong | Mingyang Zhou | Hao Liao | Naipeng Chao
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly deployed in role-play scenarios, but their safety implications remain under-characterized. We present an explanatory framework grounded in Bandura’s Moral Disengagement theory and introduce a diagnostic benchmark (MD-Trace) for role-play jailbreaks. In our experiments, role-play improves safety behavior for benign personas while increasing unsafe compliance for malicious ones. We observe a Knowing-but-Doing failure in which models recognize safety risks in their thinking traces yet proceed to comply with harmful requests. Mechanism analysis suggests that Moral Justification is dominant, with Disregard of Consequences appearing as a secondary pattern. We compare multiple attack and defense methods and find that the diagnosis aligns with observed failure modes. Finally, we propose MD-Shield, an introspection-based defense that reduces attack success while maintaining Role Fidelity. The source code is publicly available at https://github.com/lavapapa/MoralJustify/.
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
Qimin Zhong | Hao Liao | Haiming Qin | Mingyang Zhou | Rui Mao | Wei Chen | Naipeng Chao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qimin Zhong | Hao Liao | Haiming Qin | Mingyang Zhou | Rui Mao | Wei Chen | Naipeng Chao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method **Latent Semantic Enhancement MTP (LSE-MTP)**, which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.