Zhihong Zhu

2024

pdf bib abs
Code-Switching Can be Better Aligners: Advancing Cross-Lingual SLU through Representation-Level and Prediction-Level Alignment
Zhihong Zhu | Xuxin Cheng | Zhanpeng Chen | Xianwei Zhuang | Zhiqi Huang | Yuexian Zou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Zero-shot cross-lingual spoken language understanding (SLU) can promote the globalization application of dialog systems, which has attracted increasing attention. While current code-switching based cross-lingual SLU frameworks have shown promising results, they (i) predominantly utilize contrastive objectives to model hard alignment, which may disrupt the inherent structure within sentences of each language; and (ii) focus optimization objectives solely on the original sentences, neglecting the relation between original sentences and code-switched sentences, which may hinder contextualized embeddings from further alignment. In this paper, we propose a novel framework dubbed REPE (short for Representation-Level and Prediction-Level Alignment), which leverages both code-switched and original sentences to achieve multi-level alignment. Specifically, REPE introduces optimal transport to facilitate soft alignment between the representations of code-switched and original sentences, thereby preserving structural integrity as much as possible. Moreover, REPE adopts multi-view learning to enforce consistency regularization between the prediction of the two sentences, aligning them into a more refined language-invariant space. Based on this, we further incorporate a self-distillation layer to boost the robustness of REPE. Extensive experiments on two benchmarks across ten languages demonstrate the superiority of the proposed REPE framework.

pdf bib abs
Relevance Is a Guiding Light: Relevance-aware Adaptive Learning for End-to-end Task-oriented Dialogue System
Zhanpeng Chen | Zhihong Zhu | Wanshi Xu | Xianwei Zhuang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Retrieving accurate domain knowledge and providing helpful information are crucial in developing an effective end-to-end task-oriented dialogue system (E2ETOD). The field has witnessed numerous methods following the retrieve-then-generate paradigm and training their systems on one specific domain. However, existing approaches still suffer from the Distractive Attributes Problem (DAP): struggling to deal with false but similar knowledge (hard negative entities), which is even more intractable when countless pieces of knowledge from different domains are blended in a real-world scenario. To alleviate DAP, we propose the Relevance-aware Adaptive Learning (ReAL) method, a two-stage training framework that eliminates hard negatives step-by-step and aligns retrieval with generation. In the first stage, we introduce a top-k adaptive contrastive loss and utilize the divergence-driven feedback from the frozen generator to pre-train the retriever. In the second stage, we propose using the metric score distribution as an anchor to align retrieval with generation. Thorough experiments on three benchmark datasets demonstrate ReAL’s superiority over existing methods, with extensive analysis validating its strong capabilities of overcoming in- and cross-domain distractions.

pdf bib abs
What are the Generator Preferences for End-to-end Task-Oriented Dialog System?
Wanshi Xu | Xianwei Zhuang | Zhanpeng Chen | Zhihong Zhu | Xuxin Cheng | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Fully end-to-end task-oriented dialogue (EToD) systems have shown excellent performance, which requires the ability to retrieve entities accurately for generation. Existing methods improve the accuracy of entity retrieval and construct data flows between retrieval results and response generator, achieving promising results. However, most of them suffer from the following issues: (1) The entity is retrieved by directly interacting with the context at a coarse-grained level, so the similarity score may be disturbed by irrelevant attributes; (2) The generator pays equal attention to retrieved entities and the context and does not learn the generation preferences for the current turn. In this paper, we propose a framework called Regulating Preferences of Generator (RPG) based on retrieval results, which includes a generator preference extractor, an entity retriever, and a generator with the gate-controlled preference regulator. The generator preference extractor not only improves the entity retriever by filtering the interference of irrelevant attributes but also provides more focused guidance to the generator by performing inter-turn attribute prediction. Experiments and analyses on three standard benchmarks show that our framework outperforms existing methods and improves the quality of the dialogue.

pdf bib abs
Dual-oriented Disentangled Network with Counterfactual Intervention for Multimodal Intent Detection
Zhanpeng Chen | Zhihong Zhu | Xianwei Zhuang | Zhiqi Huang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multimodal intent detection is designed to leverage diverse modalities for a comprehensive understanding of user intentions in real-world scenarios, thus playing a critical role in modern task-oriented dialogue systems. Existing methods have made great progress in modal alignment and fusion, however, two vital limitations are neglected: (I) close entanglement of multimodal semantics with modal structures; (II) insufficient learning of the causal effects of semantic and modality-specific information on the final predictions under the end-to-end training fashion. To alleviate the above limitations, we introduce the Dual-oriented Disentangled Network with Counterfactual Intervention (DuoDN). DuoDN addresses key limitations in current systems by effectively disentangling and utilizing modality-specific and multimodal semantic information. The model consists of a Dual-oriented Disentangled Encoder that decouples semantics-oriented and modality-oriented representations, alongside a Counterfactual Intervention Module that applies causal inference to understand causal effects by injecting confounders. Experiments on three benchmark datasets demonstrate DuoDN’s superiority over existing methods, with extensive analysis validating its advantages.

pdf bib abs
Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory
Xianwei Zhuang | Zhihong Zhu | Zhanpeng Chen | Yuxin Xie | Liming Liang | Yuexian Zou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which hinders their application in multimodal understanding and decision-making. In this work, we introduce a novel plug-and-play train-free decoding algorithm named Game and Tree based Hallucination Mitigation (GTHM), designed for mitigating VH. GTHM is inspired by empirical observations that the fuzziness of multi-granularity view perception exacerbates VH. Based on this, GTHM leverages visual information to construct a coarse-to-fine visual view tree (CFTree) that organizes visual objects, attributes, and relationships in a hierarchical manner. Additionally, we innovatively model the optimal visual-token matching process on the CFTree as the cooperative game. Specifically, we define the Tree-based Shapley Value (TSV) for each visual view on the CFTree to assess its significant contribution to the overall visual understanding, thereby determining the optimal visual granularity. Subsequently, we utilize the TSV as guidance to implement adaptive weight contrastive decoding to achieve vision-aware decoding. Extensive experiments on four popular benchmarks confirm the effectiveness of our GTHM in alleviating VH across different LVLM families without additional training or post-processing. Our code is published at https://github.com/mengchuang123/GTHM.

pdf bib abs
Cyclical Contrastive Learning Based on Geodesic for Zero-shot Cross-lingual Spoken Language Understanding
Xuxin Cheng | Zhihong Zhu | Bang Yang | Xianwei Zhuang | Hongxiang Li | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2024

Owing to the scarcity of labeled training data, Spoken Language Understanding (SLU) is still a challenging task in low-resource languages. Therefore, zero-shot cross-lingual SLU attracts more and more attention. Contrastive learning is widely applied to explicitly align representations of similar sentences across different languages. However, the vanilla contrastive learning method may face two problems in zero-shot cross-lingual SLU: (1) the consistency between different languages is neglected; (2) each utterance has two different kinds of SLU labels, i.e. slot and intent, the utterances with one different label are also pushed away without any discrimination, which limits the performance. In this paper, we propose Cyclical Contrastive Learning based on Geodesic (CCLG), which introduces cyclical contrastive learning to achieve the consistency between different languages and leverages geodesic to measure the similarity to construct the positive pairs and negative pairs. Experimental results demonstrate that our proposed framework achieves the new state-of-the-art performance on MultiATIS++ and MTOP datasets, and the model analysis further verifies that CCLG can effectively transfer knowledge between different languages.

As a crucial task in the task-oriented dialogue systems, spoken language understanding (SLU) has garnered increasing attention. However, errors from automatic speech recognition (ASR) often hinder the performance of understanding. To tackle this problem, we propose MoE-SLU, an ASR-Robust SLU framework based on the mixture-of-experts technique. Specifically, we first introduce three strategies to generate additional transcripts from clean transcripts. Then, we employ the mixture-of-experts technique to weigh the representations of the generated transcripts, ASR transcripts, and the corresponding clean manual transcripts. Additionally, we also regularize the weighted average of predictions and the predictions of ASR transcripts by minimizing the Jensen-Shannon Divergence (JSD) between these two output distributions. Experiment results on three benchmark SLU datasets demonstrate that our MoE-SLU achieves state-of-the-art performance. Further model analysis also verifies the superiority of our method.

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs’ KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge.In this work, we introduce **LOOK-M**, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. **LOOK-M** demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by **80%** in some cases, it not only achieves approximately **1.3x** faster decoding but also maintains or even **enhances** performance across a variety of long context multimodal tasks.

Multimodal emotion recognition in conversation (MERC) and multimodal emotion-cause pair extraction (MECPE) have recently garnered significant attention. Emotions are the expression of affect or feelings; responses to specific events, or situations – known as emotion causes. Both collectively explain the causality between human emotion and intents. However, existing works treat emotion recognition and emotion cause extraction as two individual problems, ignoring their natural causality. In this paper, we propose a Unified Multimodal Emotion recognition and Emotion-Cause analysis framework (UniMEEC) to explore the causality between emotion and emotion cause. Concretely, UniMEEC reformulates the MERC and MECPE tasks as mask prediction problems and unifies them with a causal prompt template. To differentiate the modal effects, UniMEEC proposes a multimodal causal prompt to probe the pre-trained knowledge specified to modality and implements cross-task and cross-modality interactions under task-oriented settings. Experiment results on four public benchmark datasets verify the model performance on MERC and MECPE tasks and achieve consistent improvements compared with the previous state-of-the-art methods.

The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introducing ALternate Contrastive Decoding (ALCD). We begin by redefining MIE tasks as an identify-and-classify process. We then separate the identification and classification functions of LLMs by selectively masking the optimization of tokens during fine-tuning. During the inference stage, we alternately contrast output distributions derived from sub-task models. This approach aims to selectively enhance the identification and classification capabilities while minimizing the influence of other inherent abilities in LLMs. Additionally, we propose an alternate adaptive constraint strategy to more effectively adjust the scale and scope of contrastive tokens. Through comprehensive experiments on two different backbones and six diverse medical information extraction tasks, ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.

pdf bib abs
Learning to Match Representations is Better for End-to-End Task-Oriented Dialog System
Wanshi Xu | Xuxin Cheng | Zhihong Zhu | Zhanpeng Chen | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2024

Due to the rapid development with pre-trained language models, fully end-to-end Task-Oriented Dialogue (TOD) systems exhibit superior performance. How to achieve the ability to efficiently retrieve entities in cross-domain large-scale databases is a key issue. Most existing end-to-end Task-Oriented Dialogue systems suffer from the following problems: The ability to handle erroneous but easily confused entities needs to be improved; Matching information between contexts and entities is not captured, leading to weak modeling of domain-invariant and interpretable features, making it difficult to generalize to unseen domains. In this paper, we propose a method for knowledge retrieval driven by matching representations. The approach consists of a matching signal extractor for extracting matching representations between contexts and entities that have generic conceptual features and hence domain invariant properties, and an Attribute Filter for filtering irrelevant information to facilitate the re-selection of entities. Experiments on three standard benchmarks at the dialogue level and on large knowledge bases show that our retriever performs knowledge retrieval more efficiently than existing approaches.

pdf bib abs
Alignment before Awareness: Towards Visual Question Localized-Answering in Robotic Surgery via Optimal Transport and Answer Semantics
Zhihong Zhu | Yunyan Zhang | Xuxin Cheng | Zhiqi Huang | Derong Xu | Xian Wu | Yefeng Zheng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The visual question localized-answering (VQLA) system has garnered increasing attention due to its potential as a knowledgeable assistant in surgical education. Apart from providing text-based answers, VQLA can also pinpoint the specific region of interest for better surgical scene understanding. Although recent Transformer-based models for VQLA have obtained promising results, they (1) conduct vanilla text-to-image cross attention, leading to unidirectional and coarse-grained alignment; (2) ignore exploiting the semantics of answers to further boost performance. In this paper, we propose a novel model termed OTAS, which first introduces optimal transport to achieve bidirectional and fine-grained alignment between images and questions, enabling more precise localization. Besides, OTAS incorporates a set of learnable candidate answer embeddings to query the probability of each answer class for a given image-question pair. Through Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the answer decision. Extensive experiments on two widely-used benchmark datasets demonstrate the superiority of our model over state-of-the-art methods.

pdf bib abs
InfoEnh: Towards Multimodal Sentiment Analysis via Information Bottleneck Filter and Optimal Transport Alignment
Yifeng Xie | Zhihong Zhu | Xuan Lu | Zhiqi Huang | Haoran Xiong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In recent years, Multimodal Sentiment Analysis (MSA) leveraging deep learning has demonstrated exceptional performance in a wide range of domains. Its success lies in effectively utilizing information from multiple modalities to analyze sentiments. Despite these advancements, MSA is confronted with two significant challenges. Firstly, each modality often has a surplus of unimportance data, which can overshadow the essential information. Secondly, the crucial cues for sentiment analysis may conflict across different modalities, thereby complicating the analysis process. These issues have a certain impact on the model’s effectiveness in MSA tasks. To address these challenges, this paper introduces a novel method tailored for MSA, termed InfoEnh. This approach utilizes a masking technique as the bottleneck for information filtering, simultaneously maximizing mutual information to retain crucial data. Furthermore, the method integrates all modalities into a common feature space via domain adaptation, which is enhanced by the application of optimal transport. Extensive experiments conducted on two benchmark MSA datasets demonstrate the effectiveness of our proposed approach. Further analyzes indicate significant improvements over the baselines.

pdf bib abs
Knowledge-enhanced Prompt Tuning for Dialogue-based Relation Extraction with Trigger and Label Semantic
Hao An | Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Yuexian Zou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Dialogue-based relation extraction (DRE) aims to determine the semantic relation of a given pair of arguments from a piece of dialogue, which has received increasing attention. Due to the low information density of dialogue text, it is difficult for the model to focus on key information. To this end, in this paper, we propose a Knowledge-Enhanced Prompt-Tuning (KEPT) method to effectively enhance DRE model by exploiting trigger and label semantic. Specifically, we propose two beneficial tasks, masked trigger prediction, and verbalizer representation learning, to effectively inject trigger knowledge and label semantic knowledge respectively. Furthermore, we convert the DRE task to a masked language modeling task to unify the format of knowledge injection and utilization, aiming to better promote DRE performance. Experimental results on the DialogRE dataset show that our KEPT achieves state-of-the-art performance in F1 and F1c scores. Detailed analyses demonstrate the effectiveness and efficiency of our proposed approach. Code is available at https://github.com/blackbookay/KEPT.

Knowledge graph completion (KGC) is a widely used method to tackle incompleteness in knowledge graphs (KGs) by making predictions for missing links. Description-based KGC leverages pre-trained language models to learn entity and relation representations with their names or descriptions, which shows promising results. However, the performance of description-based KGC is still limited by the quality of text and the incomplete structure, as it lacks sufficient entity descriptions and relies solely on relation names, leading to sub-optimal results. To address this issue, we propose MPIKGC, a general framework to compensate for the deficiency of contextualized knowledge and improve KGC by querying large language models (LLMs) from various perspectives, which involves leveraging the reasoning, explanation, and summarization capabilities of LLMs to expand entity descriptions, understand relations, and extract structures, respectively. We conducted extensive evaluation of the effectiveness and improvement of our framework based on four description-based KGC models, for both link prediction and triplet classification tasks. All codes and generated data will be publicly available after review.

pdf bib abs
Towards Multi-modal Sarcasm Detection via Disentangled Multi-grained Multi-modal Distilling
Zhihong Zhu | Xuxin Cheng | Guimin Hu | Yaowei Li | Zhiqi Huang | Yuexian Zou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multi-modal sarcasm detection aims to identify whether a given sample with multi-modal information (i.e., text and image) is sarcastic, which has received increasing attention due to the rapid growth of multi-modal posts on modern social media. However, mainstream models process the input of each modality in a holistic manner, resulting in redundant and unrefined information. Moreover, the representations of different modalities are entangled in one common latent space to perform complex cross-modal interactions, neglecting the heterogeneity and distribution gap of different modalities. To address these issues, we propose a novel framework DMMD (short for Disentangled Multi-grained Multi-modal Distilling) for multi-modal sarcasm detection, which conducts multi-grained knowledge distilling (i.e., intra-subspace and inter-subspace) based on the disentangled multi-modal representations. Concretely, the representations of each modality are disentangled explicitly into modality-agnostic/specific subspaces. Then we transfer cross-modal knowledge by conducting intra-subspace knowledge distilling in a self-adaptive pattern. We also apply mutual learning to regularize the underlying inter-subspace consistency. Extensive experiments on a commonly used benchmark demonstrate the efficacy of our DMMD over cutting-edge methods. More encouragingly, visualization results indicate the multi-modal representations display meaningful distributional patterns, and we hope it will be helpful for the community of multi-modal knowledge transfer.

pdf bib abs
Zero-Shot Spoken Language Understanding via Large Language Models: A Preliminary Study
Zhihong Zhu | Xuxin Cheng | Hao An | Zhichang Wang | Dongsheng Chen | Zhiqi Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Zero-shot Spoken Language Understanding (SLU) aims to enable task-oriented dialogue systems to understand user needs without training data. Challenging but worthwhile, zero-shot SLU reduces the time and effort that data labeling takes. Recent advancements in large language models (LLMs), such as GPT3.5 and ChatGPT, have shown promising results in zero-shot settings, which motivates us to explore prompt-based methods. In this study, we investigate whether strong SLU models can be constructed by directly prompting LLMs. Specifically, we propose a simple yet effective two-stage framework dubbed GPT-SLU, which transforms the SLU task into a question-answering problem. Powered by multi-stage mutual guided prompts, GPT-SLU can leverage the correlations between two subtasks in SLU to achieve better predictions, which is greatly explored in the traditional fine-tuning paradigm. Experimental results on three SLU benchmark datasets demonstrate the significant potential of LLMs for zero-shot SLU. Comprehensive analyses validate the effectiveness of our proposed framework and also indicate that there is still room for further improvement of LLMs in SLU scenarios.

pdf bib abs
AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition
Zhaorun Chen | Zhuokai Zhao | Zhihong Zhu | Ruiqi Zhang | Xiang Li | Bhiksha Raj | Huaxiu Yao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework **AutoPRM** that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, **AutoPRM** first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that **AutoPRM** significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, **AutoPRM** can be easily integrated with other orthogonal reasoning pipelines.

2023

pdf bib abs
Enhancing Code-Switching for Cross-lingual SLU: A Unified View of Semantic and Grammatical Coherence
Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Dongsheng Chen | Yuexian Zou
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Despite the success of spoken language understanding (SLU) in high-resource languages, achieving similar performance in low-resource settings, such as zero-shot scenarios, remains challenging due to limited labeled training data. To improve zero-shot cross-lingual SLU, recent studies have explored code-switched sentences containing tokens from multiple languages. However, vanilla code-switched sentences often lack semantic and grammatical coherence. We ascribe this lack to two issues: (1) randomly replacing code-switched tokens with equal probability and (2) disregarding token-level dependency within each language. To tackle these issues, in this paper, we propose a novel method termed SoGo, for zero-shot cross-lingual SLU. First, we use a saliency-based substitution approach to extract keywords as substitution options. Then, we introduce a novel token-level alignment strategy that considers the similarity between the context and the code-switched tokens, ensuring grammatical coherence in code-switched sentences. Extensive experiments and analyses demonstrate the superior performance of SoGo across nine languages on MultiATIS++.

pdf bib abs
ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding
Xuxin Cheng | Bowen Cao | Qichen Ye | Zhihong Zhu | Hongxiang Li | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023

Spoken language understanding (SLU) is a fundamental task in the task-oriented dialogue systems. However, the inevitable errors from automatic speech recognition (ASR) usually impair the understanding performance and lead to error propagation. Although there are some attempts to address this problem through contrastive learning, they (1) treat clean manual transcripts and ASR transcripts equally without discrimination in fine-tuning; (2) neglect the fact that the semantically similar pairs are still pushed away when applying contrastive learning; (3) suffer from the problem of Kullback–Leibler (KL) vanishing. In this paper, we propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL), a novel framework for improving ASR robustness in SLU. Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models. We also introduce a distance polarization regularizer to avoid pushing away the intra-cluster pairs as much as possible. Moreover, we use a cyclical annealing schedule to mitigate KL vanishing issue. Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.

pdf bib abs
Towards Unified Spoken Language Understanding Decoding via Label-aware Compact Linguistics Representations
Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Dongsheng Chen | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023

Joint intent detection and slot filling models have shown promising success in recent years due to the high correlations between the two tasks. However, previous works independently decode the two tasks, which could result in misaligned predictions for both tasks. To address this shortcoming, we propose a novel method named Label-aware Compact Linguistics Representation (LCLR), which leverages label embeddings to jointly guide the decoding process. Concretely, LCLR projects both task-specific hidden states into a joint label latent space, where both task-specific hidden states could be concisely represented as linear combinations of label embeddings. Such feature decomposition of task-specific hidden states increases the representing power for the linguistics of utterance. Extensive experiments on two single- and multi-intent SLU benchmarks prove that LCLR can learn more discriminative label information than previous separate decoders, and consistently outperform previous state-of-the-art methods across all metrics. More encouragingly, LCLR can be applied to boost the performance of existing approaches, making it easy to be incorporated into any existing SLU models.

pdf bib abs
MCLF: A Multi-grained Contrastive Learning Framework for ASR-robust Spoken Language Understanding
Zhiqi Huang | Dongsheng Chen | Zhihong Zhu | Xuxin Cheng
Findings of the Association for Computational Linguistics: EMNLP 2023

Enhancing the robustness towards Automatic Speech Recognition (ASR) errors is of great importance for Spoken Language Understanding (SLU). Trending ASR-robust SLU systems have witnessed impressive improvements through global contrastive learning. However, although most ASR errors occur only at local positions of utterances, they can easily lead to severe semantic changes, and utterance-level classification or comparison is difficult to distinguish such differences. To address the problem, we propose a two-stage multi-grained contrastive learning framework dubbed MCLF. Technically, we first adapt the pre-trained language models to downstream SLU datasets via the proposed multi-grained contrastive learning objective and then fine-tune it on the corresponding dataset. Besides, to facilitate contrastive learning in the pre-training stage, we explore several data augmentation methods to expand the training data. Experimental results and detailed analyses on four datasets and four BERT-like backbone models demonstrate the effectiveness of our approach.

Recent non-autoregressive Spoken Language Understanding (SLU) models have attracted increasing attention because of their encouraging inference speed. However, most of existing methods (1) suffer from the multi-modality problem since they have little prior knowledge about the reference during inference; (2) fail to achieve a satisfactory inference speed limited by their complex frameworks. To tackle these issues, in this paper, we propose a Targeted Knowledge Distillation Framework (TKDF) for multi-intent SLU, which utilizes the knowledge distillation method to improve the performance. Specifically, we first train an SLU model as the teacher model, which has higher accuracy while slower inference speed. Then we introduce an evaluator and apply a curriculum learning strategy to select proper targets for the student model. Experiment results on two public multi-intent datasets show that our approach can realize a flexible trade-off between inference speed and accuracy, achieving comparable performance to the state-of-the-art models while speeding up by over 4.5 times. More encouragingly, further analysis shows that distilling only 4% of the original data can help the student model outperform its counterpart trained on the original data by about 14.6% in terms of overall accuracy on MixATIS dataset.

pdf bib abs
MRRL: Modifying the Reference via Reinforcement Learning for Non-Autoregressive Joint Multiple Intent Detection and Slot Filling
Xuxin Cheng | Zhihong Zhu | Bowen Cao | Qichen Ye | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2023

With the rise of non-autoregressive approach, some non-autoregressive models for joint multiple intent detection and slot filling have obtained the promising inference speed. However, most existing SLU models (1) suffer from the multi-modality problem that leads to reference intents and slots may not be suitable for training; (2) lack of alignment between the correct predictions of the two tasks, which extremely limits the overall accuracy. Therefore, in this paper, we propose Modifying the Reference via Reinforcement Learning (MRRL), a novel method for multiple intent detection and slot filling, which introduces a modifier and employs reinforcement learning. Specifically, we try to provide the better training target for the non-autoregressive SLU model via modifying the reference based on the output of the non-autoregressive SLU model, and propose a suitability reward to ensure that the output of the modifier module could fit well with the output of the non-autoregressive SLU model and does not deviate too far from the reference. In addition, we also propose a compromise reward to realize a flexible trade-off between the two subtasks. Experiments on two multi-intent datasets and non-autoregressive baselines demonstrate that our MRRL could consistently improve the performance of baselines. More encouragingly, our best variant achieves new state-of-the-art results, outperforming the previous best approach by 3.6 overall accuracy on MixATIS dataset.

pdf bib abs
Syntax Matters: Towards Spoken Language Understanding via Syntax-Aware Attention
Yifeng Xie | Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Dongsheng Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Spoken Language Understanding (SLU), a crucial component of task-oriented dialogue systems, has consistently garnered attention from both academic and industrial communities. Although incorporating syntactic information into models has the potential to enhance the comprehension of user utterances and yield impressive results, its application in SLU systems remains largely unexplored. In this paper, we propose a carefully designed model termed Syntax-aware attention (SAT) to enhance SLU, where attention scopes are constrained based on relationships within the syntactic structure. Experimental results on three datasets show that our model achieves substantial improvements and excellent performance. Moreover, SAT can be integrated into other BERT-based language models to further boost their performance.