Yaqing Wang


2023

pdf bib
HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference
Haoyu Wang | Yaqing Wang | Tianci Liu | Tuo Zhao | Jing Gao
Findings of the Association for Computational Linguistics: EMNLP 2023

Pre-trained language models (LMs) have brought remarkable performance on numerous NLP tasks. However, they require significant resources and entail high computational costs for inference, making them challenging to deploy in real-world and real-time systems. Existing early exiting methods aim to reduce computational complexity by selecting the layer at which to exit, but suffer from the limitation that they have to sequentially traverse through all layers prior to the selected exit layer, which lacks flexibility and degrades their performance. To solve this problem, we propose a homotopic and adaptive layer skipping fine-tuning method named HadSkip. HadSkip adaptively selects the layers to skip based on a predefined budget. Specifically, we introduce a learnable gate before each layer of the LM to determine whether the current layer should be skipped. To tackle various challenges in training such as discrete gates and the budget constraint, we propose a fine-grained initialization strategy and homotopic optimization strategy. We conduct extensive experiments on the GLUE benchmark, and experimental results demonstrate the proposed HadSkip outperforms all state-of-the-art baselines significantly.

pdf bib
Macedon: Minimizing Representation Coding Rate Reduction for Cross-Lingual Natural Language Understanding
Haoyu Wang | Yaqing Wang | Huaxiu Yao | Jing Gao
Findings of the Association for Computational Linguistics: EMNLP 2023

Cross-lingual natural language understanding(NLU) is one of the fundamental tasks of NLP. The goal is to learn a model which can generalize well on both high-resource and low-resource language data. Recent pre-trained multilingual language models, e.g., multilingual BERT, XLM, have shown impressive performance on cross-lingual NLU tasks. However, such promising results request the use of sufficient training data, which is a difficult condition to satisfy for low-resource language. When the data is limited in those low resource languages, the accuracy of existing models will drop. In light of this challenge, we investigate the important task of how to train the cross-lingual model with abundant high-source language data and limited low-resource language data. Existing methods typically learn language-agnostic representation via adversarial training and mutual information estimation. Existing approaches may suffer When data is very limited (e.g., low-resource language) because it is challenging to estimate data distribution accurately. To tackle this issue, we propose a conceptually innovative approach to remove language-associated information via minimizing representation coding rate reduction(Macedon). Specifically, Macedon avoids using extra codes to encode language-related information, which is measured by the rate-distortion function. To validate the effectiveness of Macedon, we conduct extensive experiments on three tasks, including paraphrase identification, natural language inference, and query advertisement matching. The experiment results show that the proposed Macedon outperforms state-of-the-art cross-lingual NLU approaches.

pdf bib
Hierarchical Pretraining on Multimodal Electronic Health Records
Xiaochen Wang | Junyu Luo | Jiaqi Wang | Ziyi Yin | Suhan Cui | Yuan Zhong | Yaqing Wang | Fenglong Ma
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Pretraining has proven to be a powerful technique in natural language processing (NLP), exhibiting remarkable success in various NLP downstream tasks. However, in the medical domain, existing pretrained models on electronic health records (EHR) fail to capture the hierarchical nature of EHR data, limiting their generalization capability across diverse downstream tasks using a single pretrained model. To tackle this challenge, this paper introduces a novel, general, and unified pretraining framework called MedHMP, specifically designed for hierarchically multimodal EHR data. The effectiveness of the proposed MedHMP is demonstrated through experimental results on eight downstream tasks spanning three levels. Comparisons against eighteen baselines further highlight the efficacy of our approach.

2022

pdf bib
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
Yaqing Wang | Sahaj Agarwal | Subhabrata Mukherjee | Xiaodong Liu | Jing Gao | Ahmed Hassan Awadallah | Jianfeng Gao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules – given the underlying PEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby or a mixture of low rank decomposition matrices like LoRA to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.

pdf bib
Simplified Graph Learning for Inductive Short Text Classification
Kaixin Zheng | Yaqing Wang | Quanming Yao | Dejing Dou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Short text classification (STC) is hard as short texts lack context information and labeled data is not enough. Graph neural networks obtain the state-of-the-art on STC since they can merge various auxiliary information via the message passing framework. However, existing works conduct transductive learning, which requires retraining to accommodate new samples and takes large memory. In this paper, we present SimpleSTC which handles inductive STC problem but only leverages words. We construct word graph from an external large corpus to compensate for the lack of semantic information, and learn text graph to handle the lack of labeled data. Results show that SimpleSTC obtains state-of-the-art performance with lower memory consumption and faster inference speed.

pdf bib
RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning
Yaqing Wang | Xin Tian | Haoyi Xiong | Yueyang Li | Zeyu Chen | Sheng Guo | Dejing Dou
Findings of the Association for Computational Linguistics: NAACL 2022

Pre-trained language models (PLMs) can provide a good starting point for downstream applications. However, it is difficult to generalize PLMs to new tasks given a few labeled samples. In this work, we show that Relation Graph augmented Learning (RGL) can improve the performance of few-shot natural language understanding tasks. During learning, RGL constructs a relation graph based on the label consistency between samples in the same batch, and learns to solve the resultant node classification and link prediction problems on the relation graph. In this way, RGL fully exploits the limited supervised information, which can boost the tuning effectiveness. Extensive experimental results show that RGL consistently improves the performance of prompt-based tuning strategies.

pdf bib
LiST: Lite Prompted Self-training Makes Parameter-efficient Few-shot Learners
Yaqing Wang | Subhabrata Mukherjee | Xiaodong Liu | Jing Gao | Ahmed Awadallah | Jianfeng Gao
Findings of the Association for Computational Linguistics: NAACL 2022

We present a new method LiST for efficient fine-tuning of large pre-trained language models (PLMs) in few-shot learning settings. LiST improves over recent methods that adopt prompt-based fine-tuning (FN) using two key techniques. The first is the use of self-training to leverage large amounts of unlabeled data for prompt-based FN in few-shot settings. We use self-training in conjunction with meta-learning for re-weighting noisy pseudo-prompt labels. Traditionally, self-training is expensive as it requires updating all the model parameters repetitively. Therefore, we use a second technique for light-weight fine-tuning where we introduce a small number of task-specific parameters that are fine-tuned during self-training while keeping the PLM encoder frozen. Our experiments show that LiST can effectively leverage unlabeled data to improve the model performance for few-shot learning. Additionally, the finetuning process is efficient as it only updates a small percentage of the parameters and the overall model footprint is reduced since several tasks can share a common PLM encoder as backbone. We present a comprehensive study on six NLU tasks to validate the effectiveness of LiST. The results show that LiST improves by 35% over classic fine-tuning methods and 6% over prompt-based FN with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each task. With only 14M tunable parameters, LiST outperforms GPT-3 in-context learning by 33% on few-shot NLU tasks

2021

pdf bib
Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification
Yaqing Wang | Song Wang | Quanming Yao | Dejing Dou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Short text classification is a fundamental task in natural language processing. It is hard due to the lack of context information and labeled data in practice. In this paper, we propose a new method called SHINE, which is based on graph neural network (GNN), for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs which introduce more semantic and syntactic information. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts. Thus, comparing with existing GNN-based methods, SHINE can better exploit interactions between nodes of the same types and capture similarities between short texts. Extensive experiments on various benchmark short text datasets show that SHINE consistently outperforms state-of-the-art methods, especially with fewer labels.

pdf bib
Knowledge-Guided Paraphrase Identification
Haoyu Wang | Fenglong Ma | Yaqing Wang | Jing Gao
Findings of the Association for Computational Linguistics: EMNLP 2021

Paraphrase identification (PI), a fundamental task in natural language processing, is to identify whether two sentences express the same or similar meaning, which is a binary classification problem. Recently, BERT-like pre-trained language models have been a popular choice for the frameworks of various PI models, but almost all existing methods consider general domain text. When these approaches are applied to a specific domain, existing models cannot make accurate predictions due to the lack of professional knowledge. In light of this challenge, we propose a novel framework, namely , which can leverage the external unstructured Wikipedia knowledge to accurately identify paraphrases. We propose to mine outline knowledge of concepts related to given sentences from Wikipedia via BM25 model. After retrieving related outline knowledge, makes predictions based on both the semantic information of two sentences and the outline knowledge. Besides, we propose a gating mechanism to aggregate the semantic information-based prediction and the knowledge-based prediction. Extensive experiments are conducted on two public datasets: PARADE (a computer science domain dataset) and clinicalSTS2019 (a biomedical domain dataset). The results show that the proposed outperforms state-of-the-art methods.

pdf bib
Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework
Yaqing Wang | Haoda Chu | Chao Zhang | Jing Gao
Findings of the Association for Computational Linguistics: EMNLP 2021

In this work, we study the problem of named entity recognition (NER) in a low resource scenario, focusing on few-shot and zero-shot settings. Built upon large-scale pre-trained language models, we propose a novel NER framework, namely SpanNER, which learns from natural language supervision and enables the identification of never-seen entity classes without using in-domain labeled data. We perform extensive experiments on 5 benchmark datasets and evaluate the proposed method in the few-shot learning, domain transfer and zero-shot learning settings. The experimental results show that the proposed method can bring 10%, 23% and 26% improvements in average over the best baselines in few-shot learning, domain transfer and zero-shot learning settings respectively.