Yuexian Zou


pdf bib
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Chenyu You | Nuo Chen | Fenglin Liu | Shen Ge | Xian Wu | Yuexian Zou
Findings of the Association for Computational Linguistics: NAACL 2022

In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogues flow given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. We first show that the performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of incorporating cross-modal information to achieve good performance gains. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering. Codes and datasets will be made publicly available.

pdf bib
A Transformer-based Threshold-Free Framework for Multi-Intent NLU
Lisung Chen | Nuo Chen | Yuexian Zou | Yong Wang | Xinzhong Sun
Proceedings of the 29th International Conference on Computational Linguistics

Multi-intent natural language understanding (NLU) has recently gained attention. It detects multiple intents in an utterance, which is better suited to real-world scenarios. However, the state-of-the-art joint NLU models mainly detect multiple intents on threshold-based strategy, resulting in one main issue: the model is extremely sensitive to the threshold settings. In this paper, we propose a transformer-based Threshold-Free Multi-intent NLU model (TFMN) with multi-task learning (MTL). Specifically, we first leverage multiple layers of a transformer-based encoder to generate multi-grain representations. Then we exploit the information of the number of multiple intents in each utterance without additional manual annotations and propose an auxiliary detection task: Intent Number detection (IND). Furthermore, we propose a threshold-free intent multi-intent classifier that utilizes the output of IND task and detects the multiple intents without depending on the threshold. Extensive experiments demonstrate that our proposed model achieves superior results on two public multi-intent datasets.


pdf bib
Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
Chenyu You | Nuo Chen | Yuexian Zou
Findings of the Association for Computational Linguistics: EMNLP 2021

Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks. Our code will be publicly available after publication.

pdf bib
On Pursuit of Designing Multi-modal Transformer for Video Grounding
Meng Cao | Long Chen | Mike Zheng Shou | Can Zhang | Yuexian Zou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment boundaries. However, all these methods are not end-to-end, i.e., they always rely on some time-consuming post-processing steps to refine predictions. To this end, we reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as GTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. To facilitate the end-to-end training, we use a Cubic Embedding layer to transform the raw videos into a set of visual tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is optimized via a Many-to-One matching loss. Furthermore, we conduct comprehensive studies to investigate different model design choices. Extensive results on three benchmarks have validated the superiority of GTR. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.


pdf bib
Federated Learning for Spoken Language Understanding
Zhiqi Huang | Fenglin Liu | Yuexian Zou
Proceedings of the 28th International Conference on Computational Linguistics

Recently, spoken language understanding (SLU) has attracted extensive research interests, and various SLU datasets have been proposed to promote the development. However, most of the existing methods focus on a single individual dataset, the efforts to improve the robustness of models and obtain better performance by combining the merits of various datasets are not well studied. In this paper, we argue that if these SLU datasets are considered together, different knowledge from different datasets could be learned jointly, and there are high chances to promote the performance of each dataset. At the same time, we further attempt to prevent data leakage when unifying multiple datasets which, arguably, is more useful in an industry setting. To this end, we propose a federated learning framework, which could unify various types of datasets as well as tasks to learn and fuse various types of knowledge, i.e., text representations, from different datasets and tasks, without the sharing of downstream task data. The fused text representations merge useful features from different SLU datasets and tasks and are thus much more powerful than the original text representations alone in individual tasks. At last, in order to provide multi-granularity text representations for our framework, we propose a novel Multi-view Encoder (MV-Encoder) as the backbone of our federated learning framework. Experiments on two SLU benchmark datasets, including two tasks (intention detection and slot filling) and federated learning settings (horizontal federated learning, vertical federated learning and federated transfer learning), demonstrate the effectiveness and universality of our approach. Specifically, we are able to get 1.53% improvement on the intent detection metric accuracy. And we could also boost the performance of a strong baseline by up to 5.29% on the slot filling metric F1. Furthermore, by leveraging BERT as an additional encoder, we establish new state-of-the-art results on SNIPS and ATIS datasets, where we get 99.33% and 98.28% in terms of accuracy on intent detection task as well as 97.20% and 96.41% in terms of F1 score on slot filling task, respectively.

pdf bib
Rethinking Skip Connection with Layer Normalization
Fenglin Liu | Xuancheng Ren | Zhiyuan Zhang | Xu Sun | Yuexian Zou
Proceedings of the 28th International Conference on Computational Linguistics

Skip connection is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could by addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.