Ruiyi Zhang


pdf bib
Few-Shot Class-Incremental Learning for Named Entity Recognition
Rui Wang | Tong Yu | Handong Zhao | Sungchul Kim | Subrata Mitra | Ruiyi Zhang | Ricardo Henao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Previous work of class-incremental learning for Named Entity Recognition (NER) relies on the assumption that there exists abundance of labeled data for the training of new classes. In this work, we study a more challenging but practical problem, i.e., few-shot class-incremental learning for NER, where an NER model is trained with only few labeled samples of the new classes, without forgetting knowledge of the old ones. To alleviate the problem of catastrophic forgetting in few-shot class-incremental learning, we reconstruct synthetic training data of the old classes using the trained NER model, augmenting the training of new classes. We further develop a framework that distills from the existing model with both synthetic data, and real data from the current training set. Experimental results show that our approach achieves significant improvements over existing baselines.

pdf bib
Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns
Zihan Wang | Jiuxiang Gu | Jason Kuen | Handong Zhao | Vlad Morariu | Ruiyi Zhang | Ani Nenkova | Tong Sun | Jingbo Shang
Findings of the Association for Computational Linguistics: ACL 2022

We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.


pdf bib
Graphine: A Dataset for Graph-aware Terminology Definition Generation
Zequn Liu | Shukai Wang | Yiyang Gu | Ruiyi Zhang | Ming Zhang | Sheng Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Precisely defining the terminology is the first step in scientific communication. Developing neural text generation models for definition generation can circumvent the labor-intensity curation, further accelerating scientific discovery. Unfortunately, the lack of large-scale terminology definition dataset hinders the process toward definition generation. In this paper, we present a large-scale terminology definition dataset Graphine covering 2,010,648 terminology definition pairs, spanning 227 biomedical subdisciplines. Terminologies in each subdiscipline further form a directed acyclic graph, opening up new avenues for developing graph-aware text generation models. We then proposed a novel graph-aware definition generation model Graphex that integrates transformer with graph neural network. Our model outperforms existing text generation models by exploiting the graph structure of terminologies. We further demonstrated how Graphine can be used to evaluate pretrained language models, compare graph representation learning methods and predict sentence granularity. We envision Graphine to be a unique resource for definition generation and many other NLP tasks in biomedicine.


pdf bib
Semantic Matching for Sequence-to-Sequence Learning
Ruiyi Zhang | Changyou Chen | Xinyuan Zhang | Ke Bai | Lawrence Carin
Findings of the Association for Computational Linguistics: EMNLP 2020

In sequence-to-sequence models, classical optimal transport (OT) can be applied to semantically match generated sentences with target sentences. However, in non-parallel settings, target sentences are usually unavailable. To tackle this issue without losing the benefits of classical OT, we present a semantic matching scheme based on the Optimal Partial Transport (OPT). Specifically, our approach partially matches semantically meaningful words between source and partial target sequences. To overcome the difficulty of detecting active regions in OPT (corresponding to the words needed to be matched), we further exploit prior knowledge to perform partial matching. Extensive experiments are conducted to evaluate the proposed approach, showing consistent improvements over sequence-to-sequence tasks.

pdf bib
Improving Adversarial Text Generation by Modeling the Distant Future
Ruiyi Zhang | Changyou Chen | Zhe Gan | Wenlin Wang | Dinghan Shen | Guoyin Wang | Zheng Wen | Lawrence Carin
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation. Further, automatically generating words with similar semantics is challenging, and hand-crafted linguistic rules are difficult to apply. We consider a text planning scheme and present a model-based imitation-learning approach to alleviate the aforementioned issues. Specifically, we propose a novel guider network to focus on the generative process over a longer horizon, which can assist next-word prediction and provide intermediate rewards for generator optimization. Extensive experiments demonstrate that the proposed method leads to improved performance.

pdf bib
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Bang An | Jie Lyu | Zhenyi Wang | Chunyuan Li | Changwei Hu | Fei Tan | Ruiyi Zhang | Yifan Hu | Changyou Chen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The neural attention mechanism plays an important role in many natural language processing applications. In particular, multi-head attention extends single-head attention by allowing a model to jointly attend information from different perspectives. However, without explicit constraining, multi-head attention may suffer from attention collapse, an issue that makes different heads extract similar attentive features, thus limiting the model’s representation power. In this paper, for the first time, we provide a novel understanding of multi-head attention from a Bayesian perspective. Based on the recently developed particle-optimization sampling techniques, we propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention and consequently strengthens model’s expressiveness. Remarkably, our Bayesian interpretation provides theoretical inspirations on the not-well-understood questions: why and how one uses multi-head attention. Extensive experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity, leading to more informative representations with consistent performance improvement on multiple tasks.

pdf bib
Improving Text Generation with Student-Forcing Optimal Transport
Jianqiao Li | Chunyuan Li | Guoyin Wang | Hao Fu | Yuhchen Lin | Liqun Chen | Yizhe Zhang | Chenyang Tao | Ruiyi Zhang | Wenlin Wang | Dinghan Shen | Qian Yang | Lawrence Carin
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. We examine the necessity of adding Student-Forcing scheme during training with an imitation learning interpretation. An extension is further proposed to improve the OT learning for long sequences, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.


pdf bib
Topic-Guided Variational Auto-Encoder for Text Generation
Wenlin Wang | Zhe Gan | Hongteng Xu | Ruiyi Zhang | Guoyin Wang | Dinghan Shen | Changyou Chen | Lawrence Carin
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a topic-guided variational auto-encoder (TGVAE) model for text generation. Distinct from existing variational auto-encoder (VAE) based approaches, which assume a simple Gaussian prior for latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides a guidance to generate sentences under the topic. The neural topic module and the VAE-based neural sequence module in our model are learned jointly. In particular, a sequence of invertible Householder transformations is applied to endow the approximate posterior of the latent code with high flexibility during the model inference. Experimental results show that our TGVAE outperforms its competitors on both unconditional and conditional text generation, which can also generate semantically-meaningful sentences with various topics.