Fei Wu


pdf bib
Fast Nearest Neighbor Machine Translation
Yuxian Meng | Xiaoya Li | Xiayu Zheng | Fei Wu | Xiaofei Sun | Tianwei Zhang | Jiwei Li
Findings of the Association for Computational Linguistics: ACL 2022

Though nearest neighbor Machine Translation (kNN-MT) (CITATION) has proved to introduce significant performance boosts over standard neural MT systems, it is prohibitively slow since it uses the entire reference corpus as the datastore for the nearest neighbor search. This means each step for each beam in the beam search has to search over the entire reference corpus. kNN-MT is thus two-orders slower than vanilla MT models, making it hard to be applied to real-world applications, especially online services. In this work, we propose Fast kNN-MT to address this issue. Fast kNN-MT constructs a significantly smaller datastore for the nearest neighbor search: for each word in a source sentence, Fast kNN-MT first selects its nearest token-level neighbors, which is limited to tokens that are the same as the query token. Then at each decoding step, in contrast to using the entire corpus as the datastore, the search space is limited to target tokens corresponding to the previously selected reference source tokens. This strategy avoids search through the whole datastore for nearest neighbors and drastically improves decoding efficiency. Without loss of performance, Fast kNN-MT is two-orders faster than kNN-MT, and is only two times slower than the standard NMT model. Fast kNN-MT enables the practical use of kNN-MT systems in real-world MT applications. The code is available at https://github.com/ShannonAI/fast-knn-nmt.

pdf bib
Paraphrase Generation as Unsupervised Machine Translation
Xiaofei Sun | Yufei Tian | Yuxian Meng | Nanyun Peng | Fei Wu | Jiwei Li | Chun Fan
Proceedings of the 29th International Conference on Computational Linguistics

In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences expressing the same meaning in a large-scale unlabeled monolingual corpus. The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters. Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final model to generate paraphrases, which can be directly used for test in the unsupervised setup, or be finetuned on labeled datasets in the supervised setup. The proposed method offers merits over machine-translation-based paraphrase generation methods, as it avoids reliance on bilingual sentence pairs. It also allows human intervene with the model so that more diverse paraphrases can be generated using different filtering criteria. Extensive experiments on existing paraphrase dataset for both the supervised and unsupervised setups demonstrate the effectiveness the proposed paradigm.

pdf bib
Triggerless Backdoor Attack for NLP Tasks with Clean Labels
Leilei Gan | Jiwei Li | Tianwei Zhang | Xiaoya Li | Yuxian Meng | Fei Wu | Yi Yang | Shangwei Guo | Chun Fan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Backdoor attacks pose a new threat to NLP models. A standard strategy to construct poisoned data in backdoor attacks is to insert triggers (e.g., rare words) into selected sentences and alter the original label to a target label. This strategy comes with a severe flaw of being easily detected from both the trigger and the label perspectives: the trigger injected, which is usually a rare word, leads to an abnormal natural language expression, and thus can be easily detected by a defense model; the changed target label leads the example to be mistakenly labeled, and thus can be easily detected by manual inspections. To deal with this issue, in this paper, we propose a new strategy to perform textual backdoor attack which does not require an external trigger and the poisoned samples are correctly labeled. The core idea of the proposed strategy is to construct clean-labeled examples, whose labels are correct but can lead to test label changes when fused with the training set. To generate poisoned clean-labeled examples, we propose a sentence generation model based on the genetic algorithm to cater to the non-differentiable characteristic of text data. Extensive experiments demonstrate that the proposed attacking strategy is not only effective, but more importantly, hard to defend due to its triggerless and clean-labeled nature. Our work marks the first step towards developing triggerless attacking strategies in NLP.

pdf bib
Sentence Similarity Based on Contexts
Xiaofei Sun | Yuxian Meng | Xiang Ao | Fei Wu | Tianwei Zhang | Jiwei Li | Chun Fan
Transactions of the Association for Computational Linguistics, Volume 10

Existing methods to measure sentence similarity are faced with two challenges: (1) labeled datasets are usually limited in size, making them insufficient to train supervised neural models; and (2) there is a training-test gap for unsupervised language modeling (LM) based models to compute semantic scores between sentences, since sentence-level semantics are not explicitly modeled at training. This results in inferior performances in this task. In this work, we propose a new framework to address these two issues. The proposed framework is based on the core idea that the meaning of a sentence should be defined by its contexts, and that sentence similarity can be measured by comparing the probabilities of generating two sentences given the same context. The proposed framework is able to generate high-quality, large-scale dataset with semantic similarity scores between two sentences in an unsupervised manner, with which the train-test gap can be largely bridged. Extensive experiments show that the proposed framework achieves significant performance boosts over existing baselines under both the supervised and unsupervised settings across different datasets.

pdf bib
Dependency Parsing as MRC-based Span-Span Prediction
Leilei Gan | Yuxian Meng | Kun Kuang | Xiaofei Sun | Chun Fan | Fei Wu | Jiwei Li
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Higher-order methods for dependency parsing can partially but not fully address the issue that edges in dependency trees should be constructed at the text span/subtree level rather than word level. In this paper, we propose a new method for dependency parsing to address this issue. The proposed method constructs dependency trees by directly modeling span-span (in other words, subtree-subtree) relations. It consists of two modules: the text span proposal module which proposes candidate text spans, each of which represents a subtree in the dependency tree denoted by (root, start, end); and the span linking module, which constructs links between proposed spans. We use the machine reading comprehension (MRC) framework as the backbone to formalize the span linking module, where one span is used as query to extract the text span/subtree it should be linked to. The proposed method has the following merits: (1) it addresses the fundamental problem that edges in a dependency tree should be constructed between subtrees; (2) the MRC framework allows the method to retrieve missing spans in the span proposal stage, which leads to higher recall for eligible spans. Extensive experiments on the PTB, CTB and Universal Dependencies (UD) benchmarks demonstrate the effectiveness of the proposed method. The code is available at https://github.com/ShannonAI/mrc-for-dependency-parsing

pdf bib
End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding
Mengze Li | Tianbao Wang | Haoyu Zhang | Shengyu Zhang | Zhou Zhao | Jiaxu Miao | Wenqiao Zhang | Wenming Tan | Jin Wang | Peng Wang | Shiliang Pu | Fei Wu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural language spatial video grounding aims to detect the relevant objects in video frames with descriptive sentences as the query. In spite of the great advances, most existing methods rely on dense video frame annotations, which require a tremendous amount of human effort. To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner. One major challenge of end-to-end one-shot video grounding is the existence of videos frames that are either irrelevant to the language query or the labeled frame. Another challenge relates to the limited supervision, which might result in ineffective representation learning. To address these challenges, we designed an end-to-end model via Information Tree for One-Shot video grounding (IT-OS). Its key module, the information tree, can eliminate the interference of irrelevant frames based on branch search and branch cropping techniques. In addition, several self-supervised tasks are proposed based on the information tree to improve the representation learning under insufficient labeling. Experiments on the benchmark dataset demonstrate the effectiveness of our model.


pdf bib
ConRPG: Paraphrase Generation using Contexts as Regularizer
Yuxian Meng | Xiang Ao | Qing He | Xiaofei Sun | Qinghong Han | Fei Wu | Chun Fan | Jiwei Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A long-standing issue with paraphrase generation is the lack of reliable supervision signals. In this paper, we propose a new unsupervised paradigm for paraphrase generation based on the assumption that the probabilities of generating two sentences with the same meaning given the same context should be the same. Inspired by this fundamental idea, we propose a pipelined system which consists of paraphrase candidate generation based on contextual language models, candidate filtering using scoring functions, and paraphrase model training based on the selected candidates. The proposed paradigm offers merits over existing paraphrase generation methods: (1) using the context regularizer on meanings, the model is able to generate massive amounts of high-quality paraphrase pairs; (2) the combination of the huge amount of paraphrase candidates and further diversity-promoting filtering yields paraphrases with more lexical and syntactic diversity; and (3) using human-interpretable scoring functions to select paraphrase pairs from candidates, the proposed framework provides a channel for developers to intervene with the data generation process, leading to a more controllable model. Experimental results across different tasks and datasets demonstrate that the proposed paradigm significantly outperforms existing paraphrase approaches in both supervised and unsupervised setups.

pdf bib
Layer-wise Model Pruning based on Mutual Information
Chun Fan | Jiwei Li | Tianwei Zhang | Xiang Ao | Fei Wu | Yuxian Meng | Xiaofei Sun
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Inspired by mutual information (MI) based feature selection in SVMs and logistic regression, in this paper, we propose MI-based layer-wise pruning: for each layer of a multi-layer neural network, neurons with higher values of MI with respect to preserved neurons in the upper layer are preserved. Starting from the top softmax layer, layer-wise pruning proceeds in a top-down fashion until reaching the bottom word embedding layer. The proposed pruning strategy offers merits over weight-based pruning techniques: (1) it avoids irregular memory access since representations and matrices can be squeezed into their smaller but dense counterparts, leading to greater speedup; (2) in a manner of top-down pruning, the proposed method operates from a more global perspective based on training signals in the top layer, and prunes each layer by propagating the effect of global signals through layers, leading to better performances at the same sparsity level. Extensive experiments show that at the same sparsity level, the proposed strategy offers both greater speedup and higher performances than weight-based pruning methods (e.g., magnitude pruning, movement pruning).

pdf bib
kFolden: k-Fold Ensemble for Out-Of-Distribution Detection
Xiaoya Li | Jiwei Li | Xiaofei Sun | Chun Fan | Tianwei Zhang | Fei Wu | Yuxian Meng | Jun Zhang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Out-of-Distribution (OOD) detection is an important problem in natural language processing (NLP). In this work, we propose a simple yet effective framework kFolden, which mimics the behaviors of OOD detection during training without the use of any external data. For a task with k training labels, kFolden induces k sub-models, each of which is trained on a subset with k-1 categories with the left category masked unknown to the sub-model. Exposing an unknown label to the sub-model during training, the model is encouraged to learn to equally attribute the probability to the seen k-1 labels for the unknown label, enabling this framework to simultaneously resolve in- and out-distribution examples in a natural way via OOD simulations. Taking text classification as an archetype, we develop benchmarks for OOD detection using existing text classification datasets. By conducting comprehensive comparisons and analyses on the developed benchmarks, we demonstrate the superiority of kFolden against current methods in terms of improving OOD detection performances while maintaining improved in-domain classification accuracy.

pdf bib
BertGCN: Transductive Text Classification by Combining GNN and BERT
Yuxiao Lin | Yuxian Meng | Xiaofei Sun | Qinghong Han | Kun Kuang | Jiwei Li | Fei Wu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Zijun Sun | Xiaoya Li | Xiaofei Sun | Yuxian Meng | Xiang Ao | Qing He | Fei Wu | Jiwei Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The proposed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition and word segmentation.

pdf bib
CIL: Contrastive Instance Learning Framework for Distantly Supervised Relation Extraction
Tao Chen | Haizhou Shi | Siliang Tang | Zhigang Chen | Fei Wu | Yueting Zhuang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The journey of reducing noise from distant supervision (DS) generated training data has been started since the DS was first introduced into the relation extraction (RE) task. For the past decade, researchers apply the multi-instance learning (MIL) framework to find the most reliable feature from a bag of sentences. Although the pattern of MIL bags can greatly reduce DS noise, it fails to represent many other useful sentence features in the datasets. In many cases, these sentence features can only be acquired by extra sentence-level human annotation with heavy costs. Therefore, the performance of distantly supervised RE models is bounded. In this paper, we go beyond typical MIL framework and propose a novel contrastive instance learning (CIL) framework. Specifically, we regard the initial MIL as the relational triple encoder and constraint positive pairs against negative pairs for each instance. Experiments demonstrate the effectiveness of our proposed framework, with significant improvements over the previous methods on NYT10, GDS and KBP.


pdf bib
Dice Loss for Data-imbalanced NLP Tasks
Xiaoya Li | Xiaofei Sun | Yuxian Meng | Junjun Liang | Fei Wu | Jiwei Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Many NLP tasks such as tagging and machine reading comprehension are faced with the severe data imbalance issue: negative examples significantly outnumber positive examples, and the huge number of easy-negative examples overwhelms the training. The most commonly used cross entropy (CE) criteria is actually an accuracy-oriented objective, and thus creates a discrepancy between training and test: at training time, each training instance contributes equally to the objective function, while at test time F1 score concerns more about positive examples. In this paper, we propose to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks. Dice loss is based on the Sørensen--Dice coefficient or Tversky index , which attaches similar importance to false positives and false negatives, and is more immune to the data-imbalance issue. To further alleviate the dominating influence from easy-negative examples in training, we propose to associate training examples with dynamically adjusted weights to deemphasize easy-negative examples. Theoretical analysis shows that this strategy narrows down the gap between the F1 score in evaluation and the dice loss in training. With the proposed training objective, we observe significant performance boost on a wide range of data imbalanced NLP tasks. Notably, we are able to achieve SOTA results on CTB5, CTB6 and UD1.4 for the part of speech tagging task; SOTA results on CoNLL03, OntoNotes5.0, MSRA and OntoNotes4.0 for the named entity recognition task; along with competitive results on the tasks of machine reading comprehension and paraphrase identification.

pdf bib
A Unified MRC Framework for Named Entity Recognition
Xiaoya Li | Jingrong Feng | Yuxian Meng | Qinghong Han | Fei Wu | Jiwei Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The task of named entity recognition (NER) is normally divided into nested NER and flat NER depending on whether named entities are nested or not.Models are usually separately developed for the two tasks, since sequence labeling models, the most widely used backbone for flat NER, are only able to assign a single label to a particular token, which is unsuitable for nested NER where a token may be assigned several labels. In this paper, we propose a unified framework that is capable of handling both flat and nested NER tasks. Instead of treating the task of NER as a sequence labeling problem, we propose to formulate it as a machine reading comprehension (MRC) task. For example, extracting entities with the per label is formalized as extracting answer spans to the question “which person is mentioned in the text".This formulation naturally tackles the entity overlapping issue in nested NER: the extraction of two overlapping entities with different categories requires answering two independent questions. Additionally, since the query encodes informative prior knowledge, this strategy facilitates the process of entity extraction, leading to better performances for not only nested NER, but flat NER. We conduct experiments on both nested and flat NER datasets.Experiment results demonstrate the effectiveness of the proposed formulation. We are able to achieve a vast amount of performance boost over current SOTA models on nested NER datasets, i.e., +1.28, +2.55, +5.44, +6.37,respectively on ACE04, ACE05, GENIA and KBP17, along with SOTA results on flat NER datasets, i.e., +0.24, +1.95, +0.21, +1.49 respectively on English CoNLL 2003, English OntoNotes 5.0, Chinese MSRA and Chinese OntoNotes 4.0.

pdf bib
CorefQA: Coreference Resolution as Query-based Span Prediction
Wei Wu | Fei Wang | Arianna Yuan | Fei Wu | Jiwei Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we present CorefQA, an accurate and extensible approach for the coreference resolution task. We formulate the problem as a span prediction task, like in question answering: A query is generated for each candidate mention using its surrounding context, and a span prediction module is employed to extract the text spans of the coreferences within the document using the generated query. This formulation comes with the following key advantages: (1) The span prediction strategy provides the flexibility of retrieving mentions left out at the mention proposal stage; (2) In the question answering framework, encoding the mention and its context explicitly in a query makes it possible to have a deep and thorough examination of cues embedded in the context of coreferent mentions; and (3) A plethora of existing question answering datasets can be used for data augmentation to improve the model’s generalization capability. Experiments demonstrate significant performance boost over previous models, with 83.1 (+3.5) F1 score on the CoNLL-2012 benchmark and 87.5 (+2.5) F1 score on the GAP benchmark.

pdf bib
De-Biased Court’s View Generation with Causality
Yiquan Wu | Kun Kuang | Yating Zhang | Xiaozhong Liu | Changlong Sun | Jun Xiao | Yueting Zhuang | Luo Si | Fei Wu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Court’s view generation is a novel but essential task for legal AI, aiming at improving the interpretability of judgment prediction results and enabling automatic legal document generation. While prior text-to-text natural language generation (NLG) approaches can be used to address this problem, neglecting the confounding bias from the data generation mechanism can limit the model performance, and the bias may pollute the learning outcomes. In this paper, we propose a novel Attentional and Counterfactual based Natural Language Generation (AC-NLG) method, consisting of an attentional encoder and a pair of innovative counterfactual decoders. The attentional encoder leverages the plaintiff’s claim and fact description as input to learn a claim-aware encoder from which the claim-related information in fact description can be emphasized. The counterfactual decoders are employed to eliminate the confounding bias in data and generate judgment-discriminative court’s views (both supportive and non-supportive views) by incorporating with a synergistic judgment predictive model. Comprehensive experiments show the effectiveness of our method under both quantitative and qualitative evaluation metrics.


pdf bib
Learning Dynamic Context Augmentation for Global Entity Linking
Xiyuan Yang | Xiaotao Gu | Sheng Lin | Siliang Tang | Yueting Zhuang | Fei Wu | Zhigang Chen | Guoping Hu | Xiang Ren
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite of the recent success of collective entity linking (EL) methods, these “global” inference methods may yield sub-optimal results when the “all-mention coherence” assumption breaks, and often suffer from high computational cost at the inference stage, due to the complex search space. In this paper, we propose a simple yet effective solution, called Dynamic Context Augmentation (DCA), for collective EL, which requires only one pass through the mentions in a document. DCA sequentially accumulates context information to make efficient, collective inference, and can cope with different local EL models as a plug-and-enhance module. We explore both supervised and reinforcement learning strategies for learning the DCA model. Extensive experiments show the effectiveness of our model with different learning settings, base models, decision orders and attention mechanisms.

pdf bib
Posterior-regularized REINFORCE for Instance Selection in Distant Supervision
Qi Zhang | Siliang Tang | Xiang Ren | Fei Wu | Shiliang Pu | Yueting Zhuang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper provides a new way to improve the efficiency of the REINFORCE training process. We apply it to the task of instance selection in distant supervision. Modeling the instance selection in one bag as a sequential decision process, a reinforcement learning agent is trained to determine whether an instance is valuable or not and construct a new bag with less noisy instances. However unbiased methods, such as REINFORCE, could usually take much time to train. This paper adopts posterior regularization (PR) to integrate some domain-specific rules in instance selection using REINFORCE. As the experiment results show, this method remarkably improves the performance of the relation classifier trained on cleaned distant supervision dataset as well as the efficiency of the REINFORCE training.

pdf bib
KCAT: A Knowledge-Constraint Typing Annotation Tool
Sheng Lin | Luye Zheng | Bo Chen | Siliang Tang | Zhigang Chen | Guoping Hu | Yueting Zhuang | Fei Wu | Xiang Ren
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

In this paper, we propose an efficient Knowledge Constraint Fine-grained Entity Typing Annotation Tool, which further improves the entity typing process through entity linking together with some practical functions.


pdf bib
NITE: A Neural Inductive Teaching Framework for Domain Specific NER
Siliang Tang | Ning Zhang | Jinjiang Zhang | Fei Wu | Yueting Zhuang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In domain-specific NER, due to insufficient labeled training data, deep models usually fail to behave normally. In this paper, we proposed a novel Neural Inductive TEaching framework (NITE) to transfer knowledge from existing domain-specific NER models into an arbitrary deep neural network in a teacher-student training manner. NITE is a general framework that builds upon transfer learning and multiple instance learning, which collaboratively not only transfers knowledge to a deep student network but also reduces the noise from teachers. NITE can help deep learning methods to effectively utilize existing resources (i.e., models, labeled and unlabeled data) in a small domain. The experiment resulted on Disease NER proved that without using any labeled data, NITE can significantly boost the performance of a CNN-bidirectional LSTM-CRF NER neural network nearly over 30% in terms of F1-score.


pdf bib
Open Information Extraction Using Wikipedia
Fei Wu | Daniel S. Weld
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Machine Reading at the University of Washington
Hoifung Poon | Janara Christensen | Pedro Domingos | Oren Etzioni | Raphael Hoffmann | Chloe Kiddon | Thomas Lin | Xiao Ling | Mausam | Alan Ritter | Stefan Schoenmackers | Stephen Soderland | Dan Weld | Fei Wu | Congle Zhang
Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading


pdf bib
Portuguese-Chinese machine translation in Macao
Yi-Pmg Li | Chi-Man Pun | Fei Wu
Proceedings of Machine Translation Summit VII

There have been substantial changes in computing practices in the cyberspace, mainly as a result of the proliferation of low priced under-utilized powerfully heterogeneous computers are connected by high-speed links. In this paper we reminisce the vicissitude of computing platform and introduce our Portuguese-Chinese corpus-based machine translation (CBMT) system which employs a statistical approach with automatic bilingual alignment support. Our improved algorithm for aligning bilingual parallel texts can achieve 97% of accuracy. At the same time, we broach the "distributed translation computing" concept to construct a uniform distributed shared-object technical term retrieving workstation and achieve high computing performance balance of network where heterogeneous computers inherently root and are intermittently under-utilized. Whereby it, we can expedite to retrieve technical terms from noisy bilingual web text and build up the Portuguese-Chinese corpus-base.