@inproceedings{du-etal-2026-human,
title = "{HER}: Human-like Reasoning and Reinforcement Learning for {LLM} Role-playing",
author = "Du, Chengyu and
Wang, Xintao and
Chen, Aili and
Li, Weiyuan and
Xu, Rui and
Liu, Junteng and
Huang, Zishan and
Tian, Rong and
Sun, Zijun and
Li, Yuhao and
Feng, Liheng and
Ding, Deming and
Zhao, Pengyu and
Xiao, Yanghua",
editor = "Liakata, Maria and
Moreira, Viviane P. and
Zhang, Jiajun and
Jurgens, David",
booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {ACL} 2026",
month = jul,
year = "2026",
address = "San Diego, California, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.findings-acl.1283/",
pages = "25725--25762",
ISBN = "979-8-89176-395-1",
abstract = "LLM role-playing, i.e., using large language models (LLMs) to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a non-trivial challenge. Towards cognitive simulation in LLM role-play, previous efforts have mainly suffered from two critical deficiencies: the lack of high-quality datasets with explicit reasoning traces and the absence of reliable reward signals aligned with human preferences. In this paper, we propose HER (Human Emulation Reasoning), a unified framework for cognitive-level persona simulation. HER introduces a dual-layer thinking mechanism that strictly distinguishes characters' first-person thinking processes from LLMs' third-person reasoning. To bridge the aforementioned gaps, we curate a reasoning-augmented role-playing dataset via a reverse engineering strategy for supervised learning, and construct human-aligned evaluation principles and preference-based reward models for role-play reinforcement learning. Leveraging these resources, we train HER models based on the Qwen3-32B backbone via a hybrid paradigm of supervised learning (SL) and reinforcement learning from human feedback (RLHF). Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26{\%} on the CoSER benchmark and a 14.97{\%} on the MiniMax Benchmark. Our datasets, evaluation principles, and trained models will be released to facilitate future research in cognitive-level LLM role-playing."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="du-etal-2026-human">
<titleInfo>
<title>HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing</title>
</titleInfo>
<name type="personal">
<namePart type="given">Chengyu</namePart>
<namePart type="family">Du</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Xintao</namePart>
<namePart type="family">Wang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Aili</namePart>
<namePart type="family">Chen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Weiyuan</namePart>
<namePart type="family">Li</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Rui</namePart>
<namePart type="family">Xu</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Junteng</namePart>
<namePart type="family">Liu</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zishan</namePart>
<namePart type="family">Huang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Rong</namePart>
<namePart type="family">Tian</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zijun</namePart>
<namePart type="family">Sun</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yuhao</namePart>
<namePart type="family">Li</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Liheng</namePart>
<namePart type="family">Feng</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Deming</namePart>
<namePart type="family">Ding</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Pengyu</namePart>
<namePart type="family">Zhao</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yanghua</namePart>
<namePart type="family">Xiao</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2026-07</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Findings of the Association for Computational Linguistics: ACL 2026</title>
</titleInfo>
<name type="personal">
<namePart type="given">Maria</namePart>
<namePart type="family">Liakata</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Viviane</namePart>
<namePart type="given">P</namePart>
<namePart type="family">Moreira</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jiajun</namePart>
<namePart type="family">Zhang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">David</namePart>
<namePart type="family">Jurgens</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">San Diego, California, United States</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
<identifier type="isbn">979-8-89176-395-1</identifier>
</relatedItem>
<abstract>LLM role-playing, i.e., using large language models (LLMs) to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a non-trivial challenge. Towards cognitive simulation in LLM role-play, previous efforts have mainly suffered from two critical deficiencies: the lack of high-quality datasets with explicit reasoning traces and the absence of reliable reward signals aligned with human preferences. In this paper, we propose HER (Human Emulation Reasoning), a unified framework for cognitive-level persona simulation. HER introduces a dual-layer thinking mechanism that strictly distinguishes characters’ first-person thinking processes from LLMs’ third-person reasoning. To bridge the aforementioned gaps, we curate a reasoning-augmented role-playing dataset via a reverse engineering strategy for supervised learning, and construct human-aligned evaluation principles and preference-based reward models for role-play reinforcement learning. Leveraging these resources, we train HER models based on the Qwen3-32B backbone via a hybrid paradigm of supervised learning (SL) and reinforcement learning from human feedback (RLHF). Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26% on the CoSER benchmark and a 14.97% on the MiniMax Benchmark. Our datasets, evaluation principles, and trained models will be released to facilitate future research in cognitive-level LLM role-playing.</abstract>
<identifier type="citekey">du-etal-2026-human</identifier>
<location>
<url>https://aclanthology.org/2026.findings-acl.1283/</url>
</location>
<part>
<date>2026-07</date>
<extent unit="page">
<start>25725</start>
<end>25762</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
%A Du, Chengyu
%A Wang, Xintao
%A Chen, Aili
%A Li, Weiyuan
%A Xu, Rui
%A Liu, Junteng
%A Huang, Zishan
%A Tian, Rong
%A Sun, Zijun
%A Li, Yuhao
%A Feng, Liheng
%A Ding, Deming
%A Zhao, Pengyu
%A Xiao, Yanghua
%Y Liakata, Maria
%Y Moreira, Viviane P.
%Y Zhang, Jiajun
%Y Jurgens, David
%S Findings of the Association for Computational Linguistics: ACL 2026
%D 2026
%8 July
%I Association for Computational Linguistics
%C San Diego, California, United States
%@ 979-8-89176-395-1
%F du-etal-2026-human
%X LLM role-playing, i.e., using large language models (LLMs) to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a non-trivial challenge. Towards cognitive simulation in LLM role-play, previous efforts have mainly suffered from two critical deficiencies: the lack of high-quality datasets with explicit reasoning traces and the absence of reliable reward signals aligned with human preferences. In this paper, we propose HER (Human Emulation Reasoning), a unified framework for cognitive-level persona simulation. HER introduces a dual-layer thinking mechanism that strictly distinguishes characters’ first-person thinking processes from LLMs’ third-person reasoning. To bridge the aforementioned gaps, we curate a reasoning-augmented role-playing dataset via a reverse engineering strategy for supervised learning, and construct human-aligned evaluation principles and preference-based reward models for role-play reinforcement learning. Leveraging these resources, we train HER models based on the Qwen3-32B backbone via a hybrid paradigm of supervised learning (SL) and reinforcement learning from human feedback (RLHF). Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26% on the CoSER benchmark and a 14.97% on the MiniMax Benchmark. Our datasets, evaluation principles, and trained models will be released to facilitate future research in cognitive-level LLM role-playing.
%U https://aclanthology.org/2026.findings-acl.1283/
%P 25725-25762
Markdown (Informal)
[HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing](https://aclanthology.org/2026.findings-acl.1283/) (Du et al., Findings 2026)
ACL
- Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, and Yanghua Xiao. 2026. HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25725–25762, San Diego, California, United States. Association for Computational Linguistics.