Wen Zhang

Other people with similar names: Wen Zhang, Wen Zhang

Unverified author pages with similar names: Wen Zhang

2026

Harmful memes convey offensive intent through implicit associations between visual symbols and text, requiring a broad understanding of cultural stereotypes and visual metaphors. Small-scale Multimodal Large Language Models (MLLMs) often lack the knowledge required to identify such implicit hate, whereas Large-scale MLLMs, despite their broader knowledge, exhibit systematic labeling bias. To address these challenges, we propose DR-HM, a Distill-then-Reinforce training framework with cognition-aware data synthesis for harmful meme detection, which aims to transfer knowledge from closed-source models while mitigating their biases. DR-HM introduces a six-step structured data synthesis scheme with self-refinement that decomposes meme analysis into a progressive, human-inspired reasoning process from entity recognition to harmfulness judgment. Based on the synthesized reasoning data, we further adopt a Distill-then-Reinforce training strategy. This approach combines a two-stage Supervised Fine-Tuning (SFT) with an Adaptive Group Relative Policy Optimization (A-GRPO) algorithm, which incorporates class-ratio-aware reward weighting and dynamic KL coefficients. Experiments on three benchmark datasets show that the proposed approach consistently outperforms existing methods and achieves an accuracy of 84.7% on the FHM dataset, approaching the reported performance of human annotators.

pdf bib abs

Large Reasoning Models (LRMs) often suffer from overthinking, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose NEAT, a Neuron-based Early reAsoning exiT framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing any additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22% to 28% when averaged over the four benchmarks, while maintaining accuracy.

pdf bib abs

Existing multimodal emotion and intent recognition tasks predominantly focus on classification, overlooking the underlying rationale and intrinsic connections between these states. Bridging this gap, we propose **Joint Multimodal Emotion-Intent Explanation and Classification, JX4MEI**, a novel task requiring the model to jointly predict emotion and intent, while generating natural language explanations for why they co-occur. To support this, we present **XMEI-dataset**, a large-scale benchmark of 15,461 multimodal samples spanning 7 emotion and 9 intent categories across text, audio, and visual modalities. Unlike prior works, our dataset provides fine-grained rationales for emotion, intent, and their causal interplay, curated via a rigorous pipeline involving Chain-of-Thought generation and strict human refinement to eliminate model artifacts. Furthermore, we propose **XMEI-Qwen**, a model equipped with a novel **Language-Query Former (LQ-Former)**. By leveraging modality-specific captions as semantic queries, LQ-Former injects explicit semantic guidance into feature alignment, significantly enhancing reasoning capabilities. Empirical experiments demonstrate that XMEI-Qwen sets a new state-of-the-art on the JX4MEI task, outperforming competitive baselines in both prediction and explanation generation. Code: https://github.com/OrangeYeah1027/JX4MEI.

pdf bib abs

ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation
Zhuoyue Gao | Xiaohui Wang | Xiaocui Yang | Wen Zhang | Daling Wang | Shi Feng | Yifei Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose ES4R, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations are then integrated with textual semantics through speech-guided cross-modal attention to generate empathetic responses. For speech output, we employ energy-based strategy selection and style fusion to achieve empathetic speech synthesis. ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different Large Language Model (LLM) backbones. Code: https://github.com/Bean0901/ES4R.

pdf bib abs

Cat-MoD: Accelerating Multimodal Alignment via Caption Token Guided Asymmetric Mixture-of-Depths
YiJie Huang | Xiaocui Yang | Shi Feng | Wen Zhang | Kaisong Song | Yifei Zhang | Daling Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Efficiently aligning visual features with Large Language Models (LLMs) remains a critical bottleneck in Multimodal LLMs. Existing query-based alignment modules (e.g., Q-Former) rely on randomly initialized queries, resulting in an inefficient cold start exploration process. Furthermore, they enforce uniform cross-attention across all layers, leading to computational redundancy. Our empirical analysis reveals that query tokens initialized with language priors can rapidly capture global semantics, leading to early representation convergence after only a few layers. In this paper, we propose **Cat-MoD**, a **Ca**ption **t**oken Guided Asymmetric **M**ixture-**o**f-**D**epths framework. It incorporates a **Hybrid Query Construction** module where Guide Tokens initialized from coarse-grained linguistic priors rapidly anchor global semantic context, and randomly initialized Explorer Tokens remain active to capture fine-grained visual details. Exploiting this early convergence, we introduce an **Asymmetric Mixture-of-Depths** mechanism, where a similarity-aware router dynamically prunes redundant tokens from expensive cross-attention layers while preserving their context in self-attention. Experiments on multiple benchmarks demonstrate that Cat-MoD matches or surpasses baseline performance, while substantially reducing alignment FLOPs by approximately 37% during both training and inference, offering a highly efficient solution for multimodal alignment. Code: https://github.com/JasonOrange0726/Cat-MoD.

Co-authors

Venues

Findings3
ACL2

Fix author