Wenxuan Zhang

Other people with similar names: Wenxuan Zhang

Unverified author pages with similar names: Wenxuan Zhang

2026

Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction. The core code and dataset are publicly available at https://github.com/sheunghung/EAR.

pdf bib abs

Large language models (LLMs)-based multi-agent systems have recently shown strong potential for machine translation (MT). However, their application to multi-domain translation (MDT) remains under-explored, particularly in addressing cross-domain word ambiguity. To investigate whether multi-agent approaches can help disambiguation in MDT, we propose a multi-agent collaborative disambiguation framework for MDT (MACD), which leverages the collaborative capabilities of LLMs for disambiguation. MACD consists of four cooperating agents responsible for domain allocation, general translation, domain disambiguation, and translation fusion. Experimental results show that MACD significantly improves translation performance across multiple domains and enhances disambiguation accuracy. Our approach reveals several findings on multi-agent collaboration in resolving word ambiguities.

pdf bib abs

Do LLMs Really Know What They Don’t Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
Chi Seng Cheang | Hou Pong Chan | Wenxuan Zhang | Yang Deng
Findings of the Association for Computational Linguistics: ACL 2026

Recent work suggests that LLMs "know what they don’t know", positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals.However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining.In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model’s parameters.To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs.Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection.

pdf bib abs

Existing psychological counseling datasets often suffer from monolithic client personas, insufficient therapeutic depth, and a lack of process controllability. To address these critical limitations, we propose PsyChain, a chain-of-agents framework that evolves static counseling corpora into high-fidelity dialogues through collaborative simulation which explicitly models client personality, stage progression, safety monitoring, and expert supervision. PsyChain involves a Client Profiler that extracts life scenarios and pairs them with psychological personality archetypes to synthesize diverse profiles.To simulate the complete counseling process, five specialized agents—Process Monitor, Client Speaker, Safety Monitor, Counselor Supervisor, and Counselor Speaker—collaborate and interact autonomously at each dialogue turn to ensure therapeutic professionalism and safety.We apply this to construct PsyChainD, a Chinese dataset of 10,456 dialogues featuring systematically diverse client profiles. Extensive evaluation across client side, counselor side and overall quality shows substantial improvements. The model trained on PsyChainD achieves 61-91% win rates against domain-specific baselines in pairwise evaluation and the highest average score in human evaluation, indicating potential for real-world counseling.

pdf bib abs

Language of Thought Shapes Output Diversity in Large Language Models
Shaoyang Xu | Wenxuan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Output diversity is crucial for Large Language Models as it underpins pluralism and creativity.In this work, we reveal that controlling the language used during model thinking—the *language of thought*—provides a novel and structural source of output diversity.Our preliminary study shows that different thinking languages occupy distinct regions in a model’s thinking space.Based on this observation, we study two repeated sampling strategies under multilingual thinking—*Single-Language Sampling* and *Mixed-Language Sampling*—and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used.Across extensive experiments, we demonstrate that switching the thinking language from English to non-English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains.We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model’s diversity ceiling.Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at https://github.com/iNLP-Lab/Multilingual-LoT-Diversity.

pdf bib abs

DR-Arena: an Automated Evaluation Framework for Deep Research Agents
Yiwen Gao | Ruochen Zhao | Yang Deng | Wenxuan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Venues

Findings4
ACL2

Fix author

Wenxuan Zhang

2026

Co-authors

Venues