Ruizhe Li


pdf bib
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Yuchen Hu | Chen Chen | Ruizhe Li | Heqing Zou | Eng Siong Chng
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Audio-visual speech recognition (AVSR) attracts a surge of research interest recently by leveraging multimodal signals to understand human speech. Mainstream approaches addressing this task have developed sophisticated architectures and techniques for multi-modality fusion and representation learning. However, the natural heterogeneity of different modalities causes distribution gap between their representations, making it challenging to fuse them. In this paper, we aim to learn the shared representations across modalities to bridge their gap. Different from existing similar methods on other multimodal tasks like sentiment analysis, we focus on the temporal contextual dependencies considering the sequence-to-sequence task setting of AVSR. In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN), which captures the commonality across modalities to ease the subsequent multimodal fusion process. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach outperforms the state-of-the-arts.

pdf bib
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Yuchen Hu | Ruizhe Li | Chen Chen | Chengwei Qin | Qiu-Shi Zhu | Eng Siong Chng
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noisy conditions. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach achieves the state-of-the-art under various noisy as well as clean conditions. In addition, we also outperform previous state-of-the-arts on visual speech recognition task.


pdf bib
Affective Decoding for Empathetic Response Generation
Chengkun Zeng | Guanyi Chen | Chenghua Lin | Ruizhe Li | Zhi Chen
Proceedings of the 14th International Conference on Natural Language Generation

Understanding speaker’s feelings and producing appropriate responses with emotion connection is a key communicative skill for empathetic dialogue systems. In this paper, we propose a simple technique called Affective Decoding for empathetic response generation. Our method can effectively incorporate emotion signals during each decoding step, and can additionally be augmented with an auxiliary dual emotion encoder, which learns separate embeddings for the speaker and listener given the emotion base of the dialogue. Extensive empirical studies show that our models are perceived to be more empathetic by human evaluations, in comparison to several strong mainstream methods for empathetic responding.


pdf bib
Improving Variational Autoencoder for Text Modelling with Timestep-Wise Regularisation
Ruizhe Li | Xiao Li | Guanyi Chen | Chenghua Lin
Proceedings of the 28th International Conference on Computational Linguistics

The Variational Autoencoder (VAE) is a popular and powerful model applied to text modelling to generate diverse sentences. However, an issue known as posterior collapse (or KL loss vanishing) happens when the VAE is used in text modelling, where the approximate posterior collapses to the prior, and the model will totally ignore the latent variables and be degraded to a plain language model during text generation. Such an issue is particularly prevalent when RNN-based VAE models are employed for text modelling. In this paper, we propose a simple, generic architecture called Timestep-Wise Regularisation VAE (TWR-VAE), which can effectively avoid posterior collapse and can be applied to any RNN-based VAE models. The effectiveness and versatility of our model are demonstrated in different tasks, including language modelling and dialogue response generation.

pdf bib
DGST: a Dual-Generator Network for Text Style Transfer
Xiao Li | Guanyi Chen | Chenghua Lin | Ruizhe Li
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose DGST, a novel and simple Dual-Generator network architecture for text Style Transfer. Our model employs two generators only, and does not rely on any discriminators or parallel corpus for training. Both quantitative and qualitative experiments on the Yelp and IMDb datasets show that our model gives competitive performance compared to several strong baselines with more complicated architecture designs.


pdf bib
A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification
Ruizhe Li | Chenghua Lin | Matthew Collinson | Xiao Li | Guanyi Chen
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Recognising dialogue acts (DA) is important for many natural language processing tasks such as dialogue generation and intention recognition. In this paper, we propose a dual-attention hierarchical recurrent neural network for DA classification. Our model is partially inspired by the observation that conversational utterances are normally associated with both a DA and a topic, where the former captures the social act and the latter describes the subject matter. However, such a dependency between DAs and topics has not been utilised by most existing systems for DA classification. With a novel dual task-specific attention mechanism, our model is able, for utterances, to capture information about both DAs and topics, as well as information about the interactions between them. Experimental results show that by modelling topic as an auxiliary task, our model can significantly improve DA classification, yielding better or comparable performance to the state-of-the-art method on three public datasets.

pdf bib
A Stable Variational Autoencoder for Text Modelling
Ruizhe Li | Xiao Li | Chenghua Lin | Matthew Collinson | Rui Mao
Proceedings of the 12th International Conference on Natural Language Generation

Variational Autoencoder (VAE) is a powerful method for learning representations of high-dimensional data. However, VAEs can suffer from an issue known as latent variable collapse (or KL term vanishing), where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks. Such an issue is particularly prevalent when employing VAE-RNN architectures for text modelling (Bowman et al., 2016; Yang et al., 2017). In this paper, we present a new architecture called Full-Sampling-VAE-RNN, which can effectively avoid latent variable collapse. Compared to the general VAE-RNN architectures, we show that our model can achieve much more stable training process and can generate text with significantly better quality.


pdf bib
ABDN at SemEval-2018 Task 10: Recognising Discriminative Attributes using Context Embeddings and WordNet
Rui Mao | Guanyi Chen | Ruizhe Li | Chenghua Lin
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes the system that we submitted for SemEval-2018 task 10: capturing discriminative attributes. Our system is built upon a simple idea of measuring the attribute word’s similarity with each of the two semantically similar words, based on an extended word embedding method and WordNet. Instead of computing the similarities between the attribute and semantically similar words by using standard word embeddings, we propose a novel method that combines word and context embeddings which can better measure similarities. Our model is simple and effective, which achieves an average F1 score of 0.62 on the test set.