Kazumasa Omura

2024

pdf bib abs
An Empirical Study of Synthetic Data Generation for Implicit Discourse Relation Recognition
Kazumasa Omura | Fei Cheng | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Implicit Discourse Relation Recognition (IDRR), which is the task of recognizing the semantic relation between given text spans that do not contain overt clues, is a long-standing and challenging problem. In particular, the paucity of training data for some error-prone discourse relations makes the problem even more challenging. To address this issue, we propose a method of generating synthetic data for IDRR using a large language model. The proposed method is summarized as two folds: extraction of confusing discourse relation pairs based on false negative rate and synthesis of data focused on the confusion. The key points of our proposed method are utilizing a confusion matrix and adopting two-stage prompting to obtain effective synthetic data. According to the proposed method, we generated synthetic data several times larger than training examples for some error-prone discourse relations and incorporated it into training. As a result of experiments, we achieved state-of-the-art macro-F1 performance thanks to the synthetic data without sacrificing micro-F1 performance and demonstrated its positive effects especially on recognizing some infrequent discourse relations.

2023

pdf bib abs
KWJA: A Unified Japanese Analyzer Based on Foundation Models
Nobuhiro Ueda | Kazumasa Omura | Takashi Kodama | Hirokazu Kiyomaru | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We present KWJA, a high-performance unified Japanese text analyzer based on foundation models.KWJA supports a wide range of tasks, including typo correction, word segmentation, word normalization, morphological analysis, named entity recognition, linguistic feature tagging, dependency parsing, PAS analysis, bridging reference resolution, coreference resolution, and discourse relation analysis, making it the most versatile among existing Japanese text analyzers.KWJA solves these tasks in a multi-task manner but still achieves competitive or better performance compared to existing analyzers specialized for each task.KWJA is publicly available under the MIT license at https://github.com/ku-nlp/kwja.

2022

pdf bib abs
Improving Commonsense Contingent Reasoning by Pseudo-data and Its Application to the Related Tasks
Kazumasa Omura | Sadao Kurohashi
Proceedings of the 29th International Conference on Computational Linguistics

Contingent reasoning is one of the essential abilities in natural language understanding, and many language resources annotated with contingent relations have been constructed. However, despite the recent advances in deep learning, the task of contingent reasoning is still difficult for computers. In this study, we focus on the reasoning of contingent relation between basic events. Based on the existing data construction method, we automatically generate large-scale pseudo-problems and incorporate the generated data into training. We also investigate the generality of contingent knowledge through quantitative evaluation by performing transfer learning on the related tasks: discourse relation analysis, the Japanese Winograd Schema Challenge, and the JCommonsenseQA. The experimental results show the effectiveness of utilizing pseudo-problems for both the commonsense contingent reasoning task and the related tasks, which suggests the importance of contingent reasoning.

2020

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

pdf bib abs
A Method for Building a Commonsense Inference Dataset based on Basic Events
Kazumasa Omura | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present a scalable, low-bias, and low-cost method for building a commonsense inference dataset that combines automatic extraction from a corpus and crowdsourcing. Each problem is a multiple-choice question that asks contingency between basic events. We applied the proposed method to a Japanese corpus and acquired 104k problems. While humans can solve the resulting problems with high accuracy (88.9%), the accuracy of a high-performance transfer learning model is reasonably low (76.0%). We also confirmed through dataset analysis that the resulting dataset contains low bias. We released the datatset to facilitate language understanding research.

2019

pdf bib abs
Diversity-aware Event Prediction based on a Conditional Variational Autoencoder with Reconstruction
Hirokazu Kiyomaru | Kazumasa Omura | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Typical event sequences are an important class of commonsense knowledge. Formalizing the task as the generation of a next event conditioned on a current event, previous work in event prediction employs sequence-to-sequence (seq2seq) models. However, what can happen after a given event is usually diverse, a fact that can hardly be captured by deterministic models. In this paper, we propose to incorporate a conditional variational autoencoder (CVAE) into seq2seq for its ability to represent diverse next events as a probabilistic distribution. We further extend the CVAE-based seq2seq with a reconstruction mechanism to prevent the model from concentrating on highly typical events. To facilitate fair and systematic evaluation of the diversity-aware models, we also extend existing evaluation datasets by tying each current event to multiple next events. Experiments show that the CVAE-based models drastically outperform deterministic models in terms of precision and that the reconstruction mechanism improves the recall of CVAE-based models without sacrificing precision.

Co-authors

Venues

coling2
nlpcovid191
lrec1
ws1
emnlp1
show all...

acl1