Yuexian Zou


2023

pdf bib
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels
Bang Yang | Fenglin Liu | Zheng Li | Qingyu Yin | Chenyu You | Bing Yin | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023

Generating an informative and attractive title for the product is a crucial task for e-commerce. Most existing works follow the standard multimodal natural language generation approaches, e.g., image captioning, and employ the large scale of human-labelled datasets to train desirable models. However, for novel products, especially in a different domain, there are few existing labelled data. In this paper, we propose a prompt-based approach, i.e., the Multimodal Prompt Learning framework, to accurately and efficiently generate titles for novel products with limited labels. We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style. To this end, we build a set of multimodal prompts from different modalities to preserve the corresponding characteristics and writing styles of novel products. As a result, with extremely limited labels for training, the proposed method can retrieve the multimodal prompts to generate desirable titles for novel products. The experiments and analyses are conducted on five novel product categories under both the in-domain and out-of-domain experimental settings. The results show that, with only 1% of downstream labelled data for training, our proposed approach achieves the best few-shot results and even achieves competitive results with fully-supervised methods trained on 100% of training data; With the full labelled data for training, our method achieves state-of-the-art results.

pdf bib
ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding
Xuxin Cheng | Bowen Cao | Qichen Ye | Zhihong Zhu | Hongxiang Li | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023

Spoken language understanding (SLU) is a fundamental task in the task-oriented dialogue systems. However, the inevitable errors from automatic speech recognition (ASR) usually impair the understanding performance and lead to error propagation. Although there are some attempts to address this problem through contrastive learning, they (1) treat clean manual transcripts and ASR transcripts equally without discrimination in fine-tuning; (2) neglect the fact that the semantically similar pairs are still pushed away when applying contrastive learning; (3) suffer from the problem of Kullback–Leibler (KL) vanishing. In this paper, we propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL), a novel framework for improving ASR robustness in SLU. Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models. We also introduce a distance polarization regularizer to avoid pushing away the intra-cluster pairs as much as possible. Moreover, we use a cyclical annealing schedule to mitigate KL vanishing issue. Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.

pdf bib
Towards Unified Spoken Language Understanding Decoding via Label-aware Compact Linguistics Representations
Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Dongsheng Chen | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023

Joint intent detection and slot filling models have shown promising success in recent years due to the high correlations between the two tasks. However, previous works independently decode the two tasks, which could result in misaligned predictions for both tasks. To address this shortcoming, we propose a novel method named Label-aware Compact Linguistics Representation (LCLR), which leverages label embeddings to jointly guide the decoding process. Concretely, LCLR projects both task-specific hidden states into a joint label latent space, where both task-specific hidden states could be concisely represented as linear combinations of label embeddings. Such feature decomposition of task-specific hidden states increases the representing power for the linguistics of utterance. Extensive experiments on two single- and multi-intent SLU benchmarks prove that LCLR can learn more discriminative label information than previous separate decoders, and consistently outperform previous state-of-the-art methods across all metrics. More encouragingly, LCLR can be applied to boost the performance of existing approaches, making it easy to be incorporated into any existing SLU models.

pdf bib
Accelerating Multiple Intent Detection and Slot Filling via Targeted Knowledge Distillation
Xuxin Cheng | Zhihong Zhu | Wanshi Xu | Yaowei Li | Hongxiang Li | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent non-autoregressive Spoken Language Understanding (SLU) models have attracted increasing attention because of their encouraging inference speed. However, most of existing methods (1) suffer from the multi-modality problem since they have little prior knowledge about the reference during inference; (2) fail to achieve a satisfactory inference speed limited by their complex frameworks. To tackle these issues, in this paper, we propose a Targeted Knowledge Distillation Framework (TKDF) for multi-intent SLU, which utilizes the knowledge distillation method to improve the performance. Specifically, we first train an SLU model as the teacher model, which has higher accuracy while slower inference speed. Then we introduce an evaluator and apply a curriculum learning strategy to select proper targets for the student model. Experiment results on two public multi-intent datasets show that our approach can realize a flexible trade-off between inference speed and accuracy, achieving comparable performance to the state-of-the-art models while speeding up by over 4.5 times. More encouragingly, further analysis shows that distilling only 4% of the original data can help the student model outperform its counterpart trained on the original data by about 14.6% in terms of overall accuracy on MixATIS dataset.

pdf bib
MRRL: Modifying the Reference via Reinforcement Learning for Non-Autoregressive Joint Multiple Intent Detection and Slot Filling
Xuxin Cheng | Zhihong Zhu | Bowen Cao | Qichen Ye | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2023

With the rise of non-autoregressive approach, some non-autoregressive models for joint multiple intent detection and slot filling have obtained the promising inference speed. However, most existing SLU models (1) suffer from the multi-modality problem that leads to reference intents and slots may not be suitable for training; (2) lack of alignment between the correct predictions of the two tasks, which extremely limits the overall accuracy. Therefore, in this paper, we propose Modifying the Reference via Reinforcement Learning (MRRL), a novel method for multiple intent detection and slot filling, which introduces a modifier and employs reinforcement learning. Specifically, we try to provide the better training target for the non-autoregressive SLU model via modifying the reference based on the output of the non-autoregressive SLU model, and propose a suitability reward to ensure that the output of the modifier module could fit well with the output of the non-autoregressive SLU model and does not deviate too far from the reference. In addition, we also propose a compromise reward to realize a flexible trade-off between the two subtasks. Experiments on two multi-intent datasets and non-autoregressive baselines demonstrate that our MRRL could consistently improve the performance of baselines. More encouragingly, our best variant achieves new state-of-the-art results, outperforming the previous best approach by 3.6 overall accuracy on MixATIS dataset.

pdf bib
Enhancing Code-Switching for Cross-lingual SLU: A Unified View of Semantic and Grammatical Coherence
Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Dongsheng Chen | Yuexian Zou
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Despite the success of spoken language understanding (SLU) in high-resource languages, achieving similar performance in low-resource settings, such as zero-shot scenarios, remains challenging due to limited labeled training data. To improve zero-shot cross-lingual SLU, recent studies have explored code-switched sentences containing tokens from multiple languages. However, vanilla code-switched sentences often lack semantic and grammatical coherence. We ascribe this lack to two issues: (1) randomly replacing code-switched tokens with equal probability and (2) disregarding token-level dependency within each language. To tackle these issues, in this paper, we propose a novel method termed SoGo, for zero-shot cross-lingual SLU. First, we use a saliency-based substitution approach to extract keywords as substitution options. Then, we introduce a novel token-level alignment strategy that considers the similarity between the context and the code-switched tokens, ensuring grammatical coherence in code-switched sentences. Extensive experiments and analyses demonstrate the superior performance of SoGo across nine languages on MultiATIS++.

pdf bib
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang Yang | Fenglin Liu | Xian Wu | Yaowei Wang | Xu Sun | Yuexian Zou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired language. In the testing stage, MultiCapCLIP instead takes visual data as input directly to retrieve the concept prompts to generate the final visual descriptions. The extensive experiments on image and video captioning across four benchmarks and four languages (i.e., English, Chinese, German, and French) confirm the effectiveness of our approach. Compared with state-of-the-art zero-shot and weakly-supervised methods, our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at https://github.com/yangbang18/MultiCapCLIP.

2022

pdf bib
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Chenyu You | Nuo Chen | Fenglin Liu | Shen Ge | Xian Wu | Yuexian Zou
Findings of the Association for Computational Linguistics: NAACL 2022

In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogues flow given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. We first show that the performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of incorporating cross-modal information to achieve good performance gains. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering. Codes and datasets will be made publicly available.

pdf bib
A Transformer-based Threshold-Free Framework for Multi-Intent NLU
Lisung Chen | Nuo Chen | Yuexian Zou | Yong Wang | Xinzhong Sun
Proceedings of the 29th International Conference on Computational Linguistics

Multi-intent natural language understanding (NLU) has recently gained attention. It detects multiple intents in an utterance, which is better suited to real-world scenarios. However, the state-of-the-art joint NLU models mainly detect multiple intents on threshold-based strategy, resulting in one main issue: the model is extremely sensitive to the threshold settings. In this paper, we propose a transformer-based Threshold-Free Multi-intent NLU model (TFMN) with multi-task learning (MTL). Specifically, we first leverage multiple layers of a transformer-based encoder to generate multi-grain representations. Then we exploit the information of the number of multiple intents in each utterance without additional manual annotations and propose an auxiliary detection task: Intent Number detection (IND). Furthermore, we propose a threshold-free intent multi-intent classifier that utilizes the output of IND task and detects the multiple intents without depending on the threshold. Extensive experiments demonstrate that our proposed model achieves superior results on two public multi-intent datasets.

2021

pdf bib
Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
Chenyu You | Nuo Chen | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2021

Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks. Our code will be publicly available after publication.

pdf bib
On Pursuit of Designing Multi-modal Transformer for Video Grounding
Meng Cao | Long Chen | Mike Zheng Shou | Can Zhang | Yuexian Zou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment boundaries. However, all these methods are not end-to-end, i.e., they always rely on some time-consuming post-processing steps to refine predictions. To this end, we reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as GTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. To facilitate the end-to-end training, we use a Cubic Embedding layer to transform the raw videos into a set of visual tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is optimized via a Many-to-One matching loss. Furthermore, we conduct comprehensive studies to investigate different model design choices. Extensive results on three benchmarks have validated the superiority of GTR. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.

2020

pdf bib
Federated Learning for Spoken Language Understanding
Zhiqi Huang | Fenglin Liu | Yuexian Zou
Proceedings of the 28th International Conference on Computational Linguistics

Recently, spoken language understanding (SLU) has attracted extensive research interests, and various SLU datasets have been proposed to promote the development. However, most of the existing methods focus on a single individual dataset, the efforts to improve the robustness of models and obtain better performance by combining the merits of various datasets are not well studied. In this paper, we argue that if these SLU datasets are considered together, different knowledge from different datasets could be learned jointly, and there are high chances to promote the performance of each dataset. At the same time, we further attempt to prevent data leakage when unifying multiple datasets which, arguably, is more useful in an industry setting. To this end, we propose a federated learning framework, which could unify various types of datasets as well as tasks to learn and fuse various types of knowledge, i.e., text representations, from different datasets and tasks, without the sharing of downstream task data. The fused text representations merge useful features from different SLU datasets and tasks and are thus much more powerful than the original text representations alone in individual tasks. At last, in order to provide multi-granularity text representations for our framework, we propose a novel Multi-view Encoder (MV-Encoder) as the backbone of our federated learning framework. Experiments on two SLU benchmark datasets, including two tasks (intention detection and slot filling) and federated learning settings (horizontal federated learning, vertical federated learning and federated transfer learning), demonstrate the effectiveness and universality of our approach. Specifically, we are able to get 1.53% improvement on the intent detection metric accuracy. And we could also boost the performance of a strong baseline by up to 5.29% on the slot filling metric F1. Furthermore, by leveraging BERT as an additional encoder, we establish new state-of-the-art results on SNIPS and ATIS datasets, where we get 99.33% and 98.28% in terms of accuracy on intent detection task as well as 97.20% and 96.41% in terms of F1 score on slot filling task, respectively.

pdf bib
Rethinking Skip Connection with Layer Normalization
Fenglin Liu | Xuancheng Ren | Zhiyuan Zhang | Xu Sun | Yuexian Zou
Proceedings of the 28th International Conference on Computational Linguistics

Skip connection is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could by addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.