Fei Cheng


2021

pdf bib
OCHADAI-KYOTO at SemEval-2021 Task 1: Enhancing Model Generalization and Robustness for Lexical Complexity Prediction
Yuki Taya | Lis Kanashiro Pereira | Fei Cheng | Ichiro Kobayashi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

We propose an ensemble model for predicting the lexical complexity of words and multiword expressions (MWEs). The model receives as input a sentence with a target word or MWE and outputs its complexity score. Given that a key challenge with this task is the limited size of annotated data, our model relies on pretrained contextual representations from different state-of-the-art transformer-based language models (i.e., BERT and RoBERTa), and on a variety of training methods for further enhancing model generalization and robustness: multi-step fine-tuning and multi-task learning, and adversarial training. Additionally, we propose to enrich contextual representations by adding hand-crafted features during training. Our model achieved competitive results and ranked among the top-10 systems in both sub-tasks.

2020

pdf bib
Pre-training via Leveraging Assisting Languages for Neural Machine Translation
Haiyue Song | Raj Dabre | Zhuoyuan Mao | Fei Cheng | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks. However, large monolingual corpora might not always be available for the languages of interest (LOI). Thus, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. We utilize script mapping (Chinese to Japanese) to increase the similarity (number of cognates) between the monolingual corpora of helping languages and LOI. An empirical case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora, respectively, for S2S pre-training. Using only Chinese and French monolingual corpora, we were able to improve Japanese-English translation quality by up to 8.5 BLEU in low-resource scenarios.

pdf bib
A System for Worldwide COVID-19 Information Aggregation
Akiko Aizawa | Frederic Bergeron | Junjie Chen | Fei Cheng | Katsuhiko Hayashi | Kentaro Inui | Hiroyoshi Ito | Daisuke Kawahara | Masaru Kitsuregawa | Hirokazu Kiyomaru | Masaki Kobayashi | Takashi Kodama | Sadao Kurohashi | Qianying Liu | Masaki Matsubara | Yusuke Miyao | Atsuyuki Morishima | Yugo Murawaki | Kazumasa Omura | Haiyue Song | Eiichiro Sumita | Shinji Suzuki | Ribeka Tanaka | Yu Tanaka | Masashi Toyoda | Nobuhiro Ueda | Honai Ueoka | Masao Utiyama | Ying Zhong
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

pdf bib
Adversarial Training for Commonsense Inference
Lis Pereira | Xiaodong Liu | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 5th Workshop on Representation Learning for NLP

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.

pdf bib
Minimize Exposure Bias of Seq2Seq Models in Joint Entity and Relation Extraction
Ranran Haoran Zhang | Qianying Liu | Aysa Xuemo Fan | Heng Ji | Daojian Zeng | Fei Cheng | Daisuke Kawahara | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2020

Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These methods introduce exposure bias, which may cause the models overfit to the frequent label combination, thus limiting the generalization ability. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at https://github.com/WindChimeRan/OpenJERE.

pdf bib
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Fei Cheng | Masayuki Asahara | Ichiro Kobayashi | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2020

Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.

pdf bib
Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases
Shuntaro Yada | Ayami Joh | Ribeka Tanaka | Fei Cheng | Eiji Aramaki | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

Applying natural language processing (NLP) to medical and clinical texts can bring important social benefits by mining valuable information from unstructured text. A popular application for that purpose is named entity recognition (NER), but the annotation policies of existing clinical corpora have not been standardized across clinical texts of different types. This paper presents an annotation guideline aimed at covering medical documents of various types such as radiography interpretation reports and medical records. Furthermore, the annotation was designed to avoid burdensome requirements related to medical knowledge, thereby enabling corpus development without medical specialists. To achieve these design features, we specifically focus on critical lung diseases to stabilize linguistic patterns in corpora. After annotating around 1100 electronic medical records following the annotation scheme, we demonstrated its feasibility using an NER task. Results suggest that our guideline is applicable to large-scale clinical NLP projects.

2018

pdf bib
Automatic Error Correction on Japanese Functional Expressions Using Character-based Neural Machine Translation
Jun Liu | Fei Cheng | Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Inducing Temporal Relations from Time Anchor Annotation
Fei Cheng | Yusuke Miyao
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recognizing temporal relations among events and time expressions has been an essential but challenging task in natural language processing. Conventional annotation of judging temporal relations puts a heavy load on annotators. In reality, the existing annotated corpora include annotations on only “salient” event pairs, or on pairs in a fixed window of sentences. In this paper, we propose a new approach to obtain temporal relations from absolute time value (a.k.a. time anchors), which is suitable for texts containing rich temporal information such as news articles. We start from time anchors for events and time expressions, and temporal relation annotations are induced automatically by computing relative order of two time anchors. This proposal shows several advantages over the current methods for temporal relation annotation: it requires less annotation effort, can induce inter-sentence relations easily, and increases informativeness of temporal relations. We compare the empirical statistics and automatic recognition results with our data against a previous temporal relation corpus. We also reveal that our data contributes to a significant improvement of the downstream time anchor prediction task, demonstrating 14.1 point increase in overall accuracy.

2017

pdf bib
Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths
Fei Cheng | Yusuke Miyao
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Temporal relation classification is becoming an active research field. Lots of methods have been proposed, while most of them focus on extracting features from external resources. Less attention has been paid to a significant advance in a closely related task: relation extraction. In this work, we borrow a state-of-the-art method in relation extraction by adopting bidirectional long short-term memory (Bi-LSTM) along dependency paths (DP). We make a “common root” assumption to extend DP representations of cross-sentence links. In the final comparison to two state-of-the-art systems on TimeBank-Dense, our model achieves comparable performance, without using external knowledge, as well as manually annotated attributes of entities (class, tense, polarity, etc.).

2015

pdf bib
Synthetic Word Parsing Improves Chinese Word Segmentation
Fei Cheng | Kevin Duh | Yuji Matsumoto
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Parsing Chinese Synthetic Words with a Character-based Dependency Model
Fei Cheng | Kevin Duh | Yuji Matsumoto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Synthetic word analysis is a potentially important but relatively unexplored problem in Chinese natural language processing. Two issues with the conventional pipeline methods involving word segmentation are (1) the lack of a common segmentation standard and (2) the poor segmentation performance on OOV words. These issues may be circumvented if we adopt the view of character-based parsing, providing both internal structures to synthetic words and global structure to sentences in a seamless fashion. However, the accuracy of synthetic word parsing is not yet satisfactory, due to the lack of research. In view of this, we propose and present experiments on several synthetic word parsers. Additionally, we demonstrate the usefulness of incorporating large unlabelled corpora and a dictionary for this task. Our parsers significantly outperform the baseline (a pipeline method).