Chaojun Xiao


pdf bib
Plug-and-Play Knowledge Injection for Pre-trained Language Models
Zhengyan Zhang | Zhiyuan Zeng | Yankai Lin | Huadong Wang | Deming Ye | Chaojun Xiao | Xu Han | Zhiyuan Liu | Peng Li | Maosong Sun | Jie Zhou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Injecting external knowledge can improve the performance of pre-trained language models (PLMs) on various downstream NLP tasks. However, massive retraining is required to deploy new knowledge injection methods or knowledge bases for downstream tasks. In this work, we are the first to study how to improve the flexibility and efficiency of knowledge injection by reusing existing downstream models. To this end, we explore a new paradigm plug-and-play knowledge injection, where knowledge bases are injected into frozen existing downstream models by a knowledge plugin. Correspondingly, we propose a plug-and-play injection method map-tuning, which trains a mapping of knowledge embeddings to enrich model inputs with mapped embeddings while keeping model parameters frozen. Experimental results on three knowledge-driven NLP tasks show that existing injection methods are not suitable for the new paradigm, while map-tuning effectively improves the performance of downstream models. Moreover, we show that a frozen downstream model can be well adapted to different domains with different mapping networks of domain knowledge. Our code and models are available at

pdf bib
Plug-and-Play Document Modules for Pre-trained Models
Chaojun Xiao | Zhengyan Zhang | Xu Han | Chi-Min Chan | Yankai Lin | Zhiyuan Liu | Xiangyang Li | Zhonghua Li | Zhao Cao | Maosong Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large-scale pre-trained models (PTMs) have been widely used in document-oriented NLP tasks, such as question answering. However, the encoding-task coupling requirement results in the repeated encoding of the same documents for different tasks and queries, which is highly computationally inefficient. To this end, we target to decouple document encoding from downstream tasks, and propose to represent each document as a plug-and-play document module, i.e., a document plugin, for PTMs (PlugD). By inserting document plugins into the backbone PTM for downstream tasks, we can encode a document one time to handle multiple tasks, which is more efficient than conventional encoding-task coupling methods that simultaneously encode documents and input queries using task-specific encoders. Extensive experiments on 8 datasets of 4 typical NLP tasks show that PlugD enables models to encode documents once and for all across different scenarios. Especially, PlugD can save 69% computational costs while achieving comparable performance to state-of-the-art encoding-task coupling methods. Additionally, we show that PlugD can serve as an effective post-processing way to inject knowledge into task-specific models, improving model performance without any additional model training. Our code and checkpoints can be found in

pdf bib
Emergent Modularity in Pre-trained Transformers
Zhengyan Zhang | Zhiyuan Zeng | Yankai Lin | Chaojun Xiao | Xiaozhi Wang | Xu Han | Zhiyuan Liu | Ruobing Xie | Maosong Sun | Jie Zhou
Findings of the Association for Computational Linguistics: ACL 2023

This work examines the presence of modularity in pre-trained Transformers, a feature commonly found in human brains and thought to be vital for general intelligence. In analogy to human brains, we consider two main characteristics of modularity: (1) functional specialization of neurons: we evaluate whether each neuron is mainly specialized in a certain function, and find that the answer is yes. (2) function-based neuron grouping: we explore to find a structure that groups neurons into modules by function, and each module works for its corresponding function. Given the enormous amount of possible structures, we focus on Mixture-of-Experts as a promising candidate, which partitions neurons into experts and usually activates different experts for different inputs. Experimental results show that there are functional experts, where clustered are the neurons specialized in a certain function. Moreover, perturbing the activations of functional experts significantly affects the corresponding function. Finally, we study how modularity emerges during pre-training, and find that the modular structure is stabilized at the early stage, which is faster than neuron stabilization. It suggests that Transformer first constructs the modular structure and then learns fine-grained neuron functions. Our code and data are available at

pdf bib
Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules
Chaojun Xiao | Yuqi Luo | Wenbin Zhang | Pengle Zhang | Xu Han | Yankai Lin | Zhengyan Zhang | Ruobing Xie | Zhiyuan Liu | Maosong Sun | Jie Zhou
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models (LLMs) have achieved remarkable results on NLP tasks but at the expense of huge parameter sizes and the consequent computational costs. In this paper, we propose Variator, a parameter-efficient acceleration method that enhances computational efficiency through plug-and-play compression plugins. Compression plugins are designed to reduce the sequence length via compressing multiple hidden vectors into one and trained with original LLMs frozen. Different from traditional model acceleration methods, which compress LLMs to smaller sizes, Variator offers two distinct advantages: (1) In real-world applications, the plug-and-play nature of our compression plugins enables dynamic selection of different compression plugins with varying acceleration ratios based on the current workload. (2) The compression plugin comprises a few compact neural network layers with minimal parameters, significantly saving storage and memory overhead, particularly in scenarios with a growing number of tasks. We validate the effectiveness of Variator on seven datasets. Experimental results show that Variator can save 53% computational costs using only 0.9% additional parameters with a performance drop of less than 2%. Moreover, when the model scales to billions of parameters, Variator matches the strong performance of uncompressed LLMs. Our code and checkpoints will be released to facilitate future work.


pdf bib
LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
Feng Yao | Chaojun Xiao | Xiaozhi Wang | Zhiyuan Liu | Lei Hou | Cunchao Tu | Juanzi Li | Yun Liu | Weixing Shen | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2022

Recognizing facts is the most fundamental step in making judgments, hence detecting events in the legal documents is important to legal case analysis tasks. However, existing Legal Event Detection (LED) datasets only concern incomprehensive event types and have limited annotated data, which restricts the development of LED methods and their downstream applications. To alleviate these issues, we present LEVEN a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types. Not only charge-related events, LEVEN also covers general events, which are critical for legal case understanding but neglected in existing LED datasets. To our knowledge, LEVEN is the largest LED dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods. The results of extensive experiments indicate that LED is challenging and needs further effort. Moreover, we simply utilize legal events as side information to promote downstream applications. The method achieves improvements of average 2.2 points precision in low-resource judgment prediction, and 1.5 points mean average precision in unsupervised case retrieval, which suggests the fundamentality of LED. The source code and dataset can be obtained from


pdf bib
Denoising Relation Extraction from Document-level Distant Supervision
Chaojun Xiao | Yuan Yao | Ruobing Xie | Xu Han | Zhiyuan Liu | Maosong Sun | Fen Lin | Leyu Lin
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Distant supervision (DS) has been widely adopted to generate auto-labeled data for sentence-level relation extraction (RE) and achieved great results. However, the existing success of DS cannot be directly transferred to more challenging document-level relation extraction (DocRE), as the inevitable noise caused by DS may be even multiplied in documents and significantly harm the performance of RE. To alleviate this issue, we propose a novel pre-trained model for DocRE, which de-emphasize noisy DS data via multiple pre-training tasks. The experimental results on the large-scale DocRE benchmark show that our model can capture useful information from noisy data and achieve promising results.

pdf bib
How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence
Haoxi Zhong | Chaojun Xiao | Cunchao Tu | Tianyang Zhang | Zhiyuan Liu | Maosong Sun
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Legal Artificial Intelligence (LegalAI) focuses on applying the technology of artificial intelligence, especially natural language processing, to benefit tasks in the legal domain. In recent years, LegalAI has drawn increasing attention rapidly from both AI researchers and legal professionals, as LegalAI is beneficial to the legal system for liberating legal professionals from a maze of paperwork. Legal professionals often think about how to solve tasks from rule-based and symbol-based methods, while NLP researchers concentrate more on data-driven and embedding methods. In this paper, we introduce the history, the current state, and the future directions of research in LegalAI. We illustrate the tasks from the perspectives of legal professionals and NLP researchers and show several representative applications in LegalAI. We conduct experiments and provide an in-depth analysis of the advantages and disadvantages of existing works to explore possible future directions. You can find the implementation of our work from

pdf bib
More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction
Xu Han | Tianyu Gao | Yankai Lin | Hao Peng | Yaoliang Yang | Chaojun Xiao | Zhiyuan Liu | Peng Li | Jie Zhou | Maosong Sun
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Relational facts are an important component of human knowledge, which are hidden in vast amounts of text. In order to extract these facts from text, people have been working on relation extraction (RE) for years. From early pattern matching to current neural networks, existing RE methods have achieved significant progress. Yet with explosion of Web text and emergence of new relations, human knowledge is increasing drastically, and we thus require “more” from RE: a more powerful RE system that can robustly utilize more data, efficiently learn more relations, easily handle more complicated context, and flexibly generalize to more open domains. In this paper, we look back at existing RE methods, analyze key challenges we are facing nowadays, and show promising directions towards more powerful RE. We hope our view can advance this field and inspire more efforts in the community.


pdf bib
Legal Judgment Prediction via Topological Learning
Haoxi Zhong | Zhipeng Guo | Cunchao Tu | Chaojun Xiao | Zhiyuan Liu | Maosong Sun
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Legal Judgment Prediction (LJP) aims to predict the judgment result based on the facts of a case and becomes a promising application of artificial intelligence techniques in the legal field. In real-world scenarios, legal judgment usually consists of multiple subtasks, such as the decisions of applicable law articles, charges, fines, and the term of penalty. Moreover, there exist topological dependencies among these subtasks. While most existing works only focus on a specific subtask of judgment prediction and ignore the dependencies among subtasks, we formalize the dependencies among subtasks as a Directed Acyclic Graph (DAG) and propose a topological multi-task learning framework, TopJudge, which incorporates multiple subtasks and DAG dependencies into judgment prediction. We conduct experiments on several real-world large-scale datasets of criminal cases in the civil law system. Experimental results show that our model achieves consistent and significant improvements over baselines on all judgment prediction tasks. The source code can be obtained from