Xianwei Zhuang

2025

Hallucination has emerged as a critical challenge for large language models (LLMs) and large vision-language models (LVLMs), particularly in high-stakes medical applications. Despite its significance, dedicated research on medical hallucination remains unexplored. In this survey, we first provide a unified perspective on medical hallucination for both LLMs and LVLMs, and delve into its causes. Subsequently, we review recent advancements in detecting, evaluating, and mitigating medical hallucinations, offering a comprehensive overview of evaluation benchmarks, metrics, and strategies developed to tackle this issue. Moreover, we delineate the current challenges and delve into new frontiers, thereby shedding light on future research. We hope this work coupled with open-source resources can provide the community with quick access and spur breakthrough research in medical hallucination.

pdf bib abs

Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. To address the inconsistency issue in multilingual audio-text retrieval, we first identify two intuitive factors that contribute to inconsistency: misalignment between audio and multilingual text embeddings, and error propagation in model optimization. By systematically analyzing these factors, we derive theoretical weight error upper bounds for quantifying their effects and find that the main source of inconsistency is the data distribution error during training. This finding motivates our solution to reduce data distribution errors.We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.

2024

pdf bib abs

Code-Switching Can be Better Aligners: Advancing Cross-Lingual SLU through Representation-Level and Prediction-Level Alignment
Zhihong Zhu | Xuxin Cheng | Zhanpeng Chen | Xianwei Zhuang | Zhiqi Huang | Yuexian Zou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Zero-shot cross-lingual spoken language understanding (SLU) can promote the globalization application of dialog systems, which has attracted increasing attention. While current code-switching based cross-lingual SLU frameworks have shown promising results, they (i) predominantly utilize contrastive objectives to model hard alignment, which may disrupt the inherent structure within sentences of each language; and (ii) focus optimization objectives solely on the original sentences, neglecting the relation between original sentences and code-switched sentences, which may hinder contextualized embeddings from further alignment. In this paper, we propose a novel framework dubbed REPE (short for Representation-Level and Prediction-Level Alignment), which leverages both code-switched and original sentences to achieve multi-level alignment. Specifically, REPE introduces optimal transport to facilitate soft alignment between the representations of code-switched and original sentences, thereby preserving structural integrity as much as possible. Moreover, REPE adopts multi-view learning to enforce consistency regularization between the prediction of the two sentences, aligning them into a more refined language-invariant space. Based on this, we further incorporate a self-distillation layer to boost the robustness of REPE. Extensive experiments on two benchmarks across ten languages demonstrate the superiority of the proposed REPE framework.

pdf bib abs

MaCSC: Towards Multimodal-augmented Pre-trained Language Models via Conceptual Prototypes and Self-balancing Calibration
Xianwei Zhuang | Zhichang Wang | Xuxin Cheng | Yuxin Xie | Liming Liang | Yuexian Zou
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Pre-trained language models (PLMs) that rely solely on textual data may exhibit limitations in multimodal semantics comprehension. Existing solutions attempt to alleviate this issue by incorporating explicit image retrieval or generation techniques.However, these methods: (1) focus exclusively on the static image modality; (2) inevitably encounter modality gaps and noise; (3) indiscriminately treat all modalities.In this paper, we propose a novel multimodal-augmented framework termed MaCSC, which can infuse multimodal semantics into PLMs and facilitate a self-balancing calibration of information allocation.Specifically, MaCSC obtains modal-specific conceptual prototypes from contrastive pre-training models (e.g., CLIP),and aggregates the intra- and inter-modal semantics of the conceptual prototype to enhance PLMs.In addition, we utilize a novel self-balancing contrastive loss to achieve multi-scale self-balancing calibration of multimodal information during fine-tuning PLMs.Experimental results show that MaCSC consistently improves the performance of PLMs across various architectures and scales, and outperforms competitive baselines on multiple NLP tasks.

pdf bib abs

Dual-oriented Disentangled Network with Counterfactual Intervention for Multimodal Intent Detection
Zhanpeng Chen | Zhihong Zhu | Xianwei Zhuang | Zhiqi Huang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multimodal intent detection is designed to leverage diverse modalities for a comprehensive understanding of user intentions in real-world scenarios, thus playing a critical role in modern task-oriented dialogue systems. Existing methods have made great progress in modal alignment and fusion, however, two vital limitations are neglected: (I) close entanglement of multimodal semantics with modal structures; (II) insufficient learning of the causal effects of semantic and modality-specific information on the final predictions under the end-to-end training fashion. To alleviate the above limitations, we introduce the Dual-oriented Disentangled Network with Counterfactual Intervention (DuoDN). DuoDN addresses key limitations in current systems by effectively disentangling and utilizing modality-specific and multimodal semantic information. The model consists of a Dual-oriented Disentangled Encoder that decouples semantics-oriented and modality-oriented representations, alongside a Counterfactual Intervention Module that applies causal inference to understand causal effects by injecting confounders. Experiments on three benchmark datasets demonstrate DuoDN’s superiority over existing methods, with extensive analysis validating its advantages.

pdf bib abs

PCAD: Towards ASR-Robust Spoken Language Understanding via Prototype Calibration and Asymmetric Decoupling
Xianwei Zhuang | Xuxin Cheng | Liming Liang | Yuxin Xie | Zhichang Wang | Zhiqi Huang | Yuexian Zou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Spoken language understanding (SLU) inevitably suffers from error propagation from automatic speech recognition (ASR) in actual scenarios. Some recent works attempt to alleviate this issue through contrastive learning. However, they (1) sample negative pairs incorrectly in pre-training; (2) only focus on implicit metric learning while neglecting explicit erroneous predictions; (3) treat manual and ASR transcripts indiscriminately. In this paper, we propose a novel framework termed PCAD, which can calibrate bias and errors and achieve adaptive-balanced decoupling training. Specifically, PCAD utilizes a prototype-based loss to aggregate label and prediction priors and calibrate bias and error-prone semantics for better inter-class discrimination and intra-class consistency. We theoretically analyze the effect of this loss on robustness enhancement. Further, we leverage a teacher-student model for asymmetric decoupling training between different transcripts and formulate a novel gradient-sensitive exponential moving averaging (GS-EMA) algorithm for adaptive balance of accuracy and robustness. Experiments on three datasets show that PCAD significantly outperforms existing approaches and achieves new state-of-the-art performance.

pdf bib abs

Cyclical Contrastive Learning Based on Geodesic for Zero-shot Cross-lingual Spoken Language Understanding
Xuxin Cheng | Zhihong Zhu | Bang Yang | Xianwei Zhuang | Hongxiang Li | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2024

Owing to the scarcity of labeled training data, Spoken Language Understanding (SLU) is still a challenging task in low-resource languages. Therefore, zero-shot cross-lingual SLU attracts more and more attention. Contrastive learning is widely applied to explicitly align representations of similar sentences across different languages. However, the vanilla contrastive learning method may face two problems in zero-shot cross-lingual SLU: (1) the consistency between different languages is neglected; (2) each utterance has two different kinds of SLU labels, i.e. slot and intent, the utterances with one different label are also pushed away without any discrimination, which limits the performance. In this paper, we propose Cyclical Contrastive Learning based on Geodesic (CCLG), which introduces cyclical contrastive learning to achieve the consistency between different languages and leverages geodesic to measure the similarity to construct the positive pairs and negative pairs. Experimental results demonstrate that our proposed framework achieves the new state-of-the-art performance on MultiATIS++ and MTOP datasets, and the model analysis further verifies that CCLG can effectively transfer knowledge between different languages.

pdf bib abs

Relevance Is a Guiding Light: Relevance-aware Adaptive Learning for End-to-end Task-oriented Dialogue System
Zhanpeng Chen | Zhihong Zhu | Wanshi Xu | Xianwei Zhuang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Retrieving accurate domain knowledge and providing helpful information are crucial in developing an effective end-to-end task-oriented dialogue system (E2ETOD). The field has witnessed numerous methods following the retrieve-then-generate paradigm and training their systems on one specific domain. However, existing approaches still suffer from the Distractive Attributes Problem (DAP): struggling to deal with false but similar knowledge (hard negative entities), which is even more intractable when countless pieces of knowledge from different domains are blended in a real-world scenario. To alleviate DAP, we propose the Relevance-aware Adaptive Learning (ReAL) method, a two-stage training framework that eliminates hard negatives step-by-step and aligns retrieval with generation. In the first stage, we introduce a top-k adaptive contrastive loss and utilize the divergence-driven feedback from the frozen generator to pre-train the retriever. In the second stage, we propose using the metric score distribution as an anchor to align retrieval with generation. Thorough experiments on three benchmark datasets demonstrate ReAL’s superiority over existing methods, with extensive analysis validating its strong capabilities of overcoming in- and cross-domain distractions.

pdf bib abs

Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory
Xianwei Zhuang | Zhihong Zhu | Zhanpeng Chen | Yuxin Xie | Liming Liang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which hinders their application in multimodal understanding and decision-making. In this work, we introduce a novel plug-and-play train-free decoding algorithm named Game and Tree based Hallucination Mitigation (GTHM), designed for mitigating VH. GTHM is inspired by empirical observations that the fuzziness of multi-granularity view perception exacerbates VH. Based on this, GTHM leverages visual information to construct a coarse-to-fine visual view tree (CFTree) that organizes visual objects, attributes, and relationships in a hierarchical manner. Additionally, we innovatively model the optimal visual-token matching process on the CFTree as the cooperative game. Specifically, we define the Tree-based Shapley Value (TSV) for each visual view on the CFTree to assess its significant contribution to the overall visual understanding, thereby determining the optimal visual granularity. Subsequently, we utilize the TSV as guidance to implement adaptive weight contrastive decoding to achieve vision-aware decoding. Extensive experiments on four popular benchmarks confirm the effectiveness of our GTHM in alleviating VH across different LVLM families without additional training or post-processing. Our code is published at https://github.com/mengchuang123/GTHM.

pdf bib abs

Fully end-to-end task-oriented dialogue (EToD) systems have shown excellent performance, which requires the ability to retrieve entities accurately for generation. Existing methods improve the accuracy of entity retrieval and construct data flows between retrieval results and response generator, achieving promising results. However, most of them suffer from the following issues: (1) The entity is retrieved by directly interacting with the context at a coarse-grained level, so the similarity score may be disturbed by irrelevant attributes; (2) The generator pays equal attention to retrieved entities and the context and does not learn the generation preferences for the current turn. In this paper, we propose a framework called Regulating Preferences of Generator (RPG) based on retrieval results, which includes a generator preference extractor, an entity retriever, and a generator with the gate-controlled preference regulator. The generator preference extractor not only improves the entity retriever by filtering the interference of irrelevant attributes but also provides more focused guidance to the generator by performing inter-turn attribute prediction. Experiments and analyses on three standard benchmarks show that our framework outperforms existing methods and improves the quality of the dialogue.

pdf bib abs

As a crucial task in the task-oriented dialogue systems, spoken language understanding (SLU) has garnered increasing attention. However, errors from automatic speech recognition (ASR) often hinder the performance of understanding. To tackle this problem, we propose MoE-SLU, an ASR-Robust SLU framework based on the mixture-of-experts technique. Specifically, we first introduce three strategies to generate additional transcripts from clean transcripts. Then, we employ the mixture-of-experts technique to weigh the representations of the generated transcripts, ASR transcripts, and the corresponding clean manual transcripts. Additionally, we also regularize the weighted average of predictions and the predictions of ASR transcripts by minimizing the Jensen-Shannon Divergence (JSD) between these two output distributions. Experiment results on three benchmark SLU datasets demonstrate that our MoE-SLU achieves state-of-the-art performance. Further model analysis also verifies the superiority of our method.