Wei Zhang

Other people with similar names: Wei Zhang, Wei Zhang, Wei Zhang, Wei Zhang, Wei Zhang, Wei Zhang, Wei Zhang

Unverified author pages with similar names: Wei Zhang

2026

Large Language Models (LLMs) have demonstrated remarkable capabilities in open-domain dialogues. However, their performance in service dialogues remains suboptimal, as these require agents to guide users toward specific business objectives while dynamically tracking states and adapting strategies. This gap stems from the scarcity of high-quality training data and the difficulty in simulating authentic, goal-oriented user behaviors. We propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Simulator that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary.

pdf bib abs

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
Yizhe Zeng | Wei Zhang | Yunpeng Li | Juxin Xiao | Xiao Wang | Yuling Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model’s post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.

Co-authors

Venues

ACL1
Findings1

Fix author