Pan Zhou


2021

pdf bib
Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos
Daizong Liu | Xiaoye Qu | Jianfeng Dong | Pan Zhou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foreground-background classification upon the video and regress on the foreground frames to adaptively generate proposals. In this way, the handcrafted proposal design is discarded and the redundant proposals are decreased. Then, a proposal consolidation module is further developed to enhance the semantics of the generated proposals. Finally, we locate the target moments with these generated proposals following the top-down framework. Extensive experiments show that our proposed APGN significantly outperforms previous state-of-the-art methods on three challenging benchmarks.

pdf bib
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding
Daizong Liu | Xiaoye Qu | Pan Zhou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.

pdf bib
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition
Guolin Zheng | Yubei Xiao | Ke Gong | Pan Zhou | Xiaodan Liang | Liang Lin
Findings of the Association for Computational Linguistics: EMNLP 2021

Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.

2020

pdf bib
Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network
Daizong Liu | Xiaoye Qu | Jianfeng Dong | Pan Zhou
Proceedings of the 28th International Conference on Computational Linguistics

Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these single-step attention are insufficient to model complex video contents and restrict the higher-level reasoning demand for this task. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on the wrong position during the cross-interaction process. Then, a modulation module is developed to capture the frame-to-frame relation with the help of sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

2019

pdf bib
Adversarial Category Alignment Network for Cross-domain Sentiment Classification
Xiaoye Qu | Zhikang Zou | Yu Cheng | Yang Yang | Pan Zhou
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Cross-domain sentiment classification aims to predict sentiment polarity on a target domain utilizing a classifier learned from a source domain. Most existing adversarial learning methods focus on aligning the global marginal distribution by fooling a domain discriminator, without taking category-specific decision boundaries into consideration, which can lead to the mismatch of category-level features. In this work, we propose an adversarial category alignment network (ACAN), which attempts to enhance category consistency between the source domain and the target domain. Specifically, we increase the discrepancy of two polarity classifiers to provide diverse views, locating ambiguous features near the decision boundaries. Then the generator learns to create better features away from the category boundaries by minimizing this discrepancy. Experimental results on benchmark datasets show that the proposed method can achieve state-of-the-art performance and produce more discriminative features.