Naipeng Chao

2026

Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs’ role-playing capabilities, advancing the development of more authentic and socially adept AI agents. Our codes and long appendix are available at https://anonymous.4open.science/r/PersonaArena-B323/.

pdf bib abs

Large Language Models (LLMs) are increasingly deployed in role-play scenarios, but their safety implications remain under-characterized. We present an explanatory framework grounded in Bandura’s Moral Disengagement theory and introduce a diagnostic benchmark (MD-Trace) for role-play jailbreaks. In our experiments, role-play improves safety behavior for benign personas while increasing unsafe compliance for malicious ones. We observe a Knowing-but-Doing failure in which models recognize safety risks in their thinking traces yet proceed to comply with harmful requests. Mechanism analysis suggests that Moral Justification is dominant, with Disregard of Consequences appearing as a secondary pattern. We compare multiple attack and defense methods and find that the diagnosis aligns with observed failure modes. Finally, we propose MD-Shield, an introspection-based defense that reduces attack success while maintaining Role Fidelity. The source code is publicly available at https://github.com/lavapapa/MoralJustify/.

pdf bib abs

Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method **Latent Semantic Enhancement MTP (LSE-MTP)**, which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

Co-authors

Rui Mao 1

Venues

Findings2
ACL1

Fix author