Previous work of class-incremental learning for Named Entity Recognition (NER) relies on the assumption that there exists abundance of labeled data for the training of new classes. In this work, we study a more challenging but practical problem, i.e., few-shot class-incremental learning for NER, where an NER model is trained with only few labeled samples of the new classes, without forgetting knowledge of the old ones. To alleviate the problem of catastrophic forgetting in few-shot class-incremental learning, we reconstruct synthetic training data of the old classes using the trained NER model, augmenting the training of new classes. We further develop a framework that distills from the existing model with both synthetic data, and real data from the current training set. Experimental results show that our approach achieves significant improvements over existing baselines.
We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.
Supervised training of existing deep learning models for sequence labeling relies on large scale labeled datasets. Such datasets are generally created with crowd-source labeling. However, crowd-source labeling for tasks of sequence labeling can be expensive and time-consuming. Further, crowd-source labeling by external annotators may not be appropriate for data that contains user private information. Considering the above limitations of crowd-source labeling, we study interactive sequence labeling that allows training directly with the user feedback, which alleviates the annotation cost and maintains the user privacy. We identify two bias, namely, context bias and feedback bias, by formulating interactive sequence labeling via a Structural Causal Model (SCM). To alleviate the context and feedback bias based on the SCM, we identify the frequent context tokens as confounders in the backdoor adjustment and further propose an entropy-based modulation that is inspired by information theory. entities more sample-efficiently. With extensive experiments, we validate that our approach can effectively alleviate the biases and our models can be efficiently learnt with the user feedback.
Demonstration-based learning has shown great potential in stimulating pretrained language models’ ability under limited data scenario. Simply augmenting the input with some demonstrations can significantly improve performance on few-shot NER. However, why such demonstrations are beneficial for the learning process remains unclear since there is no explicit alignment between the demonstrations and the predictions. In this paper, we design pathological demonstrations by gradually removing intuitively useful information from the standard ones to take a deep dive of the robustness of demonstration-based sequence labeling and show that (1) demonstrations composed of random tokens still make the model a better few-shot learner; (2) the length of random demonstrations and the relevance of random tokens are the main factors affecting the performance; (3) demonstrations increase the confidence of model predictions on captured superficial patterns. We have publicly released our code at https://github.com/SALT-NLP/RobustDemo.
Precisely defining the terminology is the first step in scientific communication. Developing neural text generation models for definition generation can circumvent the labor-intensity curation, further accelerating scientific discovery. Unfortunately, the lack of large-scale terminology definition dataset hinders the process toward definition generation. In this paper, we present a large-scale terminology definition dataset Graphine covering 2,010,648 terminology definition pairs, spanning 227 biomedical subdisciplines. Terminologies in each subdiscipline further form a directed acyclic graph, opening up new avenues for developing graph-aware text generation models. We then proposed a novel graph-aware definition generation model Graphex that integrates transformer with graph neural network. Our model outperforms existing text generation models by exploiting the graph structure of terminologies. We further demonstrated how Graphine can be used to evaluate pretrained language models, compare graph representation learning methods and predict sentence granularity. We envision Graphine to be a unique resource for definition generation and many other NLP tasks in biomedicine.
In sequence-to-sequence models, classical optimal transport (OT) can be applied to semantically match generated sentences with target sentences. However, in non-parallel settings, target sentences are usually unavailable. To tackle this issue without losing the benefits of classical OT, we present a semantic matching scheme based on the Optimal Partial Transport (OPT). Specifically, our approach partially matches semantically meaningful words between source and partial target sequences. To overcome the difficulty of detecting active regions in OPT (corresponding to the words needed to be matched), we further exploit prior knowledge to perform partial matching. Extensive experiments are conducted to evaluate the proposed approach, showing consistent improvements over sequence-to-sequence tasks.
Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation. Further, automatically generating words with similar semantics is challenging, and hand-crafted linguistic rules are difficult to apply. We consider a text planning scheme and present a model-based imitation-learning approach to alleviate the aforementioned issues. Specifically, we propose a novel guider network to focus on the generative process over a longer horizon, which can assist next-word prediction and provide intermediate rewards for generator optimization. Extensive experiments demonstrate that the proposed method leads to improved performance.
The neural attention mechanism plays an important role in many natural language processing applications. In particular, multi-head attention extends single-head attention by allowing a model to jointly attend information from different perspectives. However, without explicit constraining, multi-head attention may suffer from attention collapse, an issue that makes different heads extract similar attentive features, thus limiting the model’s representation power. In this paper, for the first time, we provide a novel understanding of multi-head attention from a Bayesian perspective. Based on the recently developed particle-optimization sampling techniques, we propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention and consequently strengthens model’s expressiveness. Remarkably, our Bayesian interpretation provides theoretical inspirations on the not-well-understood questions: why and how one uses multi-head attention. Extensive experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity, leading to more informative representations with consistent performance improvement on multiple tasks.
Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. We examine the necessity of adding Student-Forcing scheme during training with an imitation learning interpretation. An extension is further proposed to improve the OT learning for long sequences, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
We propose a topic-guided variational auto-encoder (TGVAE) model for text generation. Distinct from existing variational auto-encoder (VAE) based approaches, which assume a simple Gaussian prior for latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides a guidance to generate sentences under the topic. The neural topic module and the VAE-based neural sequence module in our model are learned jointly. In particular, a sequence of invertible Householder transformations is applied to endow the approximate posterior of the latent code with high flexibility during the model inference. Experimental results show that our TGVAE outperforms its competitors on both unconditional and conditional text generation, which can also generate semantically-meaningful sentences with various topics.