Zhanpeng Chen

2025

pdf bib abs
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Zhanpeng Chen | Mingxiao Li | Ziyang Chen | Nan Du | Xiaolong Li | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2025

Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models’ comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://anonymous.4open.science/r/PyPE-34EE.

pdf bib abs
ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks
Zijing Zhang | Zhanpeng Chen | He Zhu | Ziyang Chen | Nan Du | Xiaolong Li
Findings of the Association for Computational Linguistics: ACL 2025

Tool learning enhances Large Language Models’ (LLMs) dynamic interaction with external tools, improving their ability to solve complex problems. However, current empirical methods, which primarily focus on isolated tools learning, still struggle with accurate multi-tool selection due to issues like confusing similar tools and neglecting dependencies. To address these challenges, we propose the Tool Experience Network (ToolExpNet), which integrates tools and trial-and-error experiences into a network characterized by semantic similarity and dependency relationships. ToolExpNet iteratively conducts simulated experiments using adaptive sampling to explore subtle differences and connections between tools, and summarizes these experiences to provide insightful guidance for LLM tool selection. Our experiments demonstrate that learning the relationships between tools helps achieve more comprehensive tool learning. Evaluations on multiple real-world API datasets show that ToolExpNet effectively addresses common challenges in multi-tool selection, significantly outperforming existing baselines across different foundation LLMs.

2024

pdf bib abs
Code-Switching Can be Better Aligners: Advancing Cross-Lingual SLU through Representation-Level and Prediction-Level Alignment
Zhihong Zhu | Xuxin Cheng | Zhanpeng Chen | Xianwei Zhuang | Zhiqi Huang | Yuexian Zou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Zero-shot cross-lingual spoken language understanding (SLU) can promote the globalization application of dialog systems, which has attracted increasing attention. While current code-switching based cross-lingual SLU frameworks have shown promising results, they (i) predominantly utilize contrastive objectives to model hard alignment, which may disrupt the inherent structure within sentences of each language; and (ii) focus optimization objectives solely on the original sentences, neglecting the relation between original sentences and code-switched sentences, which may hinder contextualized embeddings from further alignment. In this paper, we propose a novel framework dubbed REPE (short for Representation-Level and Prediction-Level Alignment), which leverages both code-switched and original sentences to achieve multi-level alignment. Specifically, REPE introduces optimal transport to facilitate soft alignment between the representations of code-switched and original sentences, thereby preserving structural integrity as much as possible. Moreover, REPE adopts multi-view learning to enforce consistency regularization between the prediction of the two sentences, aligning them into a more refined language-invariant space. Based on this, we further incorporate a self-distillation layer to boost the robustness of REPE. Extensive experiments on two benchmarks across ten languages demonstrate the superiority of the proposed REPE framework.

pdf bib abs
Relevance Is a Guiding Light: Relevance-aware Adaptive Learning for End-to-end Task-oriented Dialogue System
Zhanpeng Chen | Zhihong Zhu | Wanshi Xu | Xianwei Zhuang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Retrieving accurate domain knowledge and providing helpful information are crucial in developing an effective end-to-end task-oriented dialogue system (E2ETOD). The field has witnessed numerous methods following the retrieve-then-generate paradigm and training their systems on one specific domain. However, existing approaches still suffer from the Distractive Attributes Problem (DAP): struggling to deal with false but similar knowledge (hard negative entities), which is even more intractable when countless pieces of knowledge from different domains are blended in a real-world scenario. To alleviate DAP, we propose the Relevance-aware Adaptive Learning (ReAL) method, a two-stage training framework that eliminates hard negatives step-by-step and aligns retrieval with generation. In the first stage, we introduce a top-k adaptive contrastive loss and utilize the divergence-driven feedback from the frozen generator to pre-train the retriever. In the second stage, we propose using the metric score distribution as an anchor to align retrieval with generation. Thorough experiments on three benchmark datasets demonstrate ReAL’s superiority over existing methods, with extensive analysis validating its strong capabilities of overcoming in- and cross-domain distractions.

pdf bib abs
What are the Generator Preferences for End-to-end Task-Oriented Dialog System?
Wanshi Xu | Xianwei Zhuang | Zhanpeng Chen | Zhihong Zhu | Xuxin Cheng | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Fully end-to-end task-oriented dialogue (EToD) systems have shown excellent performance, which requires the ability to retrieve entities accurately for generation. Existing methods improve the accuracy of entity retrieval and construct data flows between retrieval results and response generator, achieving promising results. However, most of them suffer from the following issues: (1) The entity is retrieved by directly interacting with the context at a coarse-grained level, so the similarity score may be disturbed by irrelevant attributes; (2) The generator pays equal attention to retrieved entities and the context and does not learn the generation preferences for the current turn. In this paper, we propose a framework called Regulating Preferences of Generator (RPG) based on retrieval results, which includes a generator preference extractor, an entity retriever, and a generator with the gate-controlled preference regulator. The generator preference extractor not only improves the entity retriever by filtering the interference of irrelevant attributes but also provides more focused guidance to the generator by performing inter-turn attribute prediction. Experiments and analyses on three standard benchmarks show that our framework outperforms existing methods and improves the quality of the dialogue.

pdf bib abs
Dual-oriented Disentangled Network with Counterfactual Intervention for Multimodal Intent Detection
Zhanpeng Chen | Zhihong Zhu | Xianwei Zhuang | Zhiqi Huang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multimodal intent detection is designed to leverage diverse modalities for a comprehensive understanding of user intentions in real-world scenarios, thus playing a critical role in modern task-oriented dialogue systems. Existing methods have made great progress in modal alignment and fusion, however, two vital limitations are neglected: (I) close entanglement of multimodal semantics with modal structures; (II) insufficient learning of the causal effects of semantic and modality-specific information on the final predictions under the end-to-end training fashion. To alleviate the above limitations, we introduce the Dual-oriented Disentangled Network with Counterfactual Intervention (DuoDN). DuoDN addresses key limitations in current systems by effectively disentangling and utilizing modality-specific and multimodal semantic information. The model consists of a Dual-oriented Disentangled Encoder that decouples semantics-oriented and modality-oriented representations, alongside a Counterfactual Intervention Module that applies causal inference to understand causal effects by injecting confounders. Experiments on three benchmark datasets demonstrate DuoDN’s superiority over existing methods, with extensive analysis validating its advantages.

pdf bib abs
Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory
Xianwei Zhuang | Zhihong Zhu | Zhanpeng Chen | Yuxin Xie | Liming Liang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which hinders their application in multimodal understanding and decision-making. In this work, we introduce a novel plug-and-play train-free decoding algorithm named Game and Tree based Hallucination Mitigation (GTHM), designed for mitigating VH. GTHM is inspired by empirical observations that the fuzziness of multi-granularity view perception exacerbates VH. Based on this, GTHM leverages visual information to construct a coarse-to-fine visual view tree (CFTree) that organizes visual objects, attributes, and relationships in a hierarchical manner. Additionally, we innovatively model the optimal visual-token matching process on the CFTree as the cooperative game. Specifically, we define the Tree-based Shapley Value (TSV) for each visual view on the CFTree to assess its significant contribution to the overall visual understanding, thereby determining the optimal visual granularity. Subsequently, we utilize the TSV as guidance to implement adaptive weight contrastive decoding to achieve vision-aware decoding. Extensive experiments on four popular benchmarks confirm the effectiveness of our GTHM in alleviating VH across different LVLM families without additional training or post-processing. Our code is published at https://github.com/mengchuang123/GTHM.

As a crucial task in the task-oriented dialogue systems, spoken language understanding (SLU) has garnered increasing attention. However, errors from automatic speech recognition (ASR) often hinder the performance of understanding. To tackle this problem, we propose MoE-SLU, an ASR-Robust SLU framework based on the mixture-of-experts technique. Specifically, we first introduce three strategies to generate additional transcripts from clean transcripts. Then, we employ the mixture-of-experts technique to weigh the representations of the generated transcripts, ASR transcripts, and the corresponding clean manual transcripts. Additionally, we also regularize the weighted average of predictions and the predictions of ASR transcripts by minimizing the Jensen-Shannon Divergence (JSD) between these two output distributions. Experiment results on three benchmark SLU datasets demonstrate that our MoE-SLU achieves state-of-the-art performance. Further model analysis also verifies the superiority of our method.

pdf bib abs
Learning to Match Representations is Better for End-to-End Task-Oriented Dialog System
Wanshi Xu | Xuxin Cheng | Zhihong Zhu | Zhanpeng Chen | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2024

Due to the rapid development with pre-trained language models, fully end-to-end Task-Oriented Dialogue (TOD) systems exhibit superior performance. How to achieve the ability to efficiently retrieve entities in cross-domain large-scale databases is a key issue. Most existing end-to-end Task-Oriented Dialogue systems suffer from the following problems: The ability to handle erroneous but easily confused entities needs to be improved; Matching information between contexts and entities is not captured, leading to weak modeling of domain-invariant and interpretable features, making it difficult to generalize to unseen domains. In this paper, we propose a method for knowledge retrieval driven by matching representations. The approach consists of a matching signal extractor for extracting matching representations between contexts and entities that have generic conceptual features and hence domain invariant properties, and an Attribute Filter for filtering irrelevant information to facilitate the re-selection of entities. Experiments on three standard benchmarks at the dialogue level and on large knowledge bases show that our retriever performs knowledge retrieval more efficiently than existing approaches.

Co-authors

Nan Du 2

He Zhu 1

Venues

Fix author