Heming Xia


2023

pdf bib
Enhancing Continual Relation Extraction via Classifier Decomposition
Heming Xia | Peiyi Wang | Tianyu Liu | Binghuai Lin | Yunbo Cao | Zhifang Sui
Findings of the Association for Computational Linguistics: ACL 2023

Continual relation extraction (CRE) models aim at handling emerging new relations while avoiding catastrophically forgetting old ones in the streaming data. Though improvements have been shown by previous CRE studies, most of them only adopt a vanilla strategy when models first learn representations of new relations. In this work, we point out that there exist two typical biases after training of this vanilla strategy: classifier bias and representation bias, which causes the previous knowledge that the model learned to be shaded. To alleviate those biases, we propose a simple yet effective classifier decomposition framework that splits the last FFN layer into separated previous and current classifiers, so as to maintain previous knowledge and encourage the model to learn more robust representations at this training stage. Experimental results on two standard benchmarks show that our proposed framework consistently outperforms the state-of-the-art CRE models, which indicates that the importance of the first training stage to CRE models may be underestimated. Our code will be released upon acceptance.

pdf bib
ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories
Heming Xia | Qingxiu Dong | Lei Li | Jingjing Xu | Tianyu Liu | Ziwei Qin | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023

Recently, Large Language Models (LLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current LLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a human-annotated dataset specifically designed for zero- and few-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we benchmark the fundamental visual commonsense knowledge of both unimodal LLMs and VaLMs. Furthermore, we analyze the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.

pdf bib
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
Heming Xia | Tao Ge | Peiyi Wang | Si-Qing Chen | Furu Wei | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter – an independent model specially optimized for efficient and accurate drafting – and Spec-Verification – a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around 5x speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only 1.4x~2x speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.

pdf bib
Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization
Shoujie Tong | Heming Xia | Damai Dai | Runxin Xu | Tianyu Liu | Binghuai Lin | Yunbo Cao | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023

Pretrained language models have achieved remarkable success in natural language understanding. However, fine-tuning pretrained models on limited training data tends to overfit and thus diminish performance. This paper presents Bi-Drop, a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets dynamically generated by dropout. The sub-net estimation of Bi-Drop is performed in an in-batch manner, so it overcomes the problem of hysteresis in sub-net updating, which is possessed by previous methods that perform asynchronous sub-net estimation. Also, Bi-Drop needs only one mini-batch to estimate the sub-net so it achieves higher utility of training data. Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods. Furthermore, empirical results also show that Bi-Drop exhibits excellent generalization ability and robustness for domain transfer, data imbalance, and low-resource scenarios.

2022

pdf bib
Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues
Qingxiu Dong | Ziwei Qin | Heming Xia | Tian Feng | Shoujie Tong | Haoran Meng | Lin Xu | Zhongyu Wei | Weidong Zhan | Baobao Chang | Sujian Li | Tianyu Liu | Zhifang Sui
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query. In this work, we take a sober look at such an “unconditional” formulation in the sense that no prior knowledge is specified with respect to the source image(s). Inspired by the designs of both visual commonsense reasoning and natural language inference tasks, we propose a new task termed “Premise-based Multi-modal Reasoning” (PMR) where a textual premise is the background presumption on each source image. The PMR dataset contains 15,360 manually annotated samples which are created by a multi-phase crowd-sourcing process. With selected high-quality movie screenshots and human-curated premise templates from 6 pre-defined categories, we ask crowd-source workers to write one true hypothesis and three distractors (4 choices) given the premise and image through a cross-check procedure.