Tsutomu Hirao

2025

Automatic Evaluation of Language Generation Technology Based on Structure Alignment
Katsuki Chousa | Tsutomu Hirao
Proceedings of the 31st International Conference on Computational Linguistics

Language generation techniques require automatic evaluation to carry out efficient and reproducible experiments. While n-gram matching is standard, it fails to capture semantic equivalence with different wording. Recent methods have addressed this issue by using contextual embeddings from pre-trained language models to compute the similarity between reference and hypothesis. However, these methods frequently disregard the syntax of sentences, despite its crucial role in determining meaning, and thus assign unjustifiably high scores. This paper proposes an automatic evaluation metric that considers both the words in sentences and their syntactic structures. We integrate syntactic information into the recent embedding-based approach. Experimental results obtained from two NLP tasks show that our method is at least comparable to standard baselines.

2024

pdf bib abs

Can we obtain significant success in RST discourse parsing by using Large Language Models?
Aru Maekawa | Tsutomu Hirao | Hidetaka Kamigaito | Manabu Okumura
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, decoder-only pre-trained large language models (LLMs), with several tens of billion parameters, have significantly impacted a wide range of natural language processing (NLP) tasks. While encoder-only or encoder-decoder pre-trained language models have already proved to be effective in discourse parsing, the extent to which LLMs can perform this task remains an open research question. Therefore, this paper explores how beneficial such LLMs are for Rhetorical Structure Theory (RST) discourse parsing. Here, the parsing process for both fundamental top-down and bottom-up strategies is converted into prompts, which LLMs can work with. We employ Llama 2 and fine-tune it with QLoRA, which has fewer parameters that can be tuned. Experimental results on three benchmark datasets, RST-DT, Instr-DT, and the GUM corpus, demonstrate that Llama 2 with 70 billion parameters in the bottom-up strategy obtained state-of-the-art (SOTA) results with significant differences. Furthermore, our parsers demonstrated generalizability when evaluated on RST-DT, showing that, in spite of being trained with the GUM corpus, it obtained similar performances to those of existing parsers trained with RST-DT.

pdf bib abs

WikiSplit++: Easy Data Refinement for Split and Rephrase
Hayato Tsukagoshi | Tsutomu Hirao | Makoto Morishita | Katsuki Chousa | Ryohei Sasano | Koichi Takeda
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.

pdf bib abs

Argument Mining as a Text-to-Text Generation Task
Masayuki Kawarada | Tsutomu Hirao | Wataru Uchida | Masaaki Nagata
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Argument Mining (AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures.Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus (AAEC), AbstRCT, and the Cornell eRulemaking Corpus (CDCP).

pdf bib abs

Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs
Masashi Oshika | Makoto Morishita | Tsutomu Hirao | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: ACL 2024

In recent years, neural machine translation (NMT) has become widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user’s language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces high Age of Acquisitions (AoA) words in translations with simpler words to match the translations to the user’s level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.

pdf bib abs

Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches
Tsutomu Hirao | Naoki Kobayashi | Hidetaka Kamigaito | Manabu Okumura | Akisato Kimura
Findings of the Association for Computational Linguistics: EMNLP 2024

This paper tackles a new task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships. We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves. We then evaluate baseline approaches to video RST parsing: the ‘parsing after captioning’ framework and parsing via visual features. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story. Furthermore, we demonstrated that RST trees obtained from videos contribute to multimodal summarization consisting of keyframes with texts.

2023

pdf bib abs

Implicit Sense-labeled Connective Recognition as Text Generation
Yui Oka | Tsutomu Hirao
Findings of the Association for Computational Linguistics: EMNLP 2023

Implicit Discourse Relation Recognition (IDRR) involves identifying the sense label of an implicit connective between adjacent text spans. This has traditionally been approached as a classification task. However, some downstream tasks require more than just a sense label as well as the specific connective used. This paper presents Implicit Sense-labeled Connective Recognition (ISCR), which identifies the implicit connectives and their sense labels between adjacent text spans. ISCR can be treated as a classification task, but a large number of potential categories, sense labels, and uneven distribution of instances among them make this difficult. Instead, this paper handles the task as a text-generation task, using an encoder-decoder model to generate both connectives and their sense labels. Here, we explore a classification method and three kinds of text-generation methods. From our evaluation results on PDTB-3.0, we found that our method outperforms the conventional classification-based method.

2022

pdf bib abs

A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing
Naoki Kobayashi | Tsutomu Hirao | Hidetaka Kamigaito | Manabu Okumura | Masaaki Nagata
Findings of the Association for Computational Linguistics: EMNLP 2022

To promote and further develop RST-style discourse parsing models, we need a strong baseline that can be regarded as a reference for reporting reliable experimental results. This paper explores a strong baseline by integrating existing simple parsing strategies, top-down and bottom-up, with various transformer-based pre-trained language models.The experimental results obtained from two benchmark datasets demonstrate that the parsing performance strongly relies on the pre-trained language models rather than the parsing strategies.In particular, the bottom-up parser achieves large performance gains compared to the current best parser when employing DeBERTa.We further reveal that language models with a span-masking scheme especially boost the parsing performance through our analysis within intra- and multi-sentential parsing, and nuclearity prediction.

2021

pdf bib abs

Improving Neural RST Parsing Model with Silver Agreement Subtrees
Naoki Kobayashi | Tsutomu Hirao | Hidetaka Kamigaito | Manabu Okumura | Masaaki Nagata
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Most of the previous Rhetorical Structure Theory (RST) parsing methods are based on supervised learning such as neural networks, that require an annotated corpus of sufficient size and quality. However, the RST Discourse Treebank (RST-DT), the benchmark corpus for RST parsing in English, is small due to the costly annotation of RST trees. The lack of large annotated training data causes poor performance especially in relation labeling. Therefore, we propose a method for improving neural RST parsing models by exploiting silver data, i.e., automatically annotated data. We create large-scale silver data from an unlabeled corpus by using a state-of-the-art RST parser. To obtain high-quality silver data, we extract agreement subtrees from RST trees for documents built using the RST parsers. We then pre-train a neural RST parser with the obtained silver data and fine-tune it on the RST-DT. Experimental results show that our method achieved the best micro-F1 scores for Nuclearity and Relation at 75.0 and 63.2, respectively. Furthermore, we obtained a remarkable gain in the Relation score, 3.0 points, against the previous state-of-the-art parser.

2020

pdf bib abs

Sequential Span Classification with Neural Semi-Markov CRFs for Biomedical Abstracts
Kosuke Yamada | Tsutomu Hirao | Ryohei Sasano | Koichi Takeda | Masaaki Nagata
Findings of the Association for Computational Linguistics: EMNLP 2020

Dividing biomedical abstracts into several segments with rhetorical roles is essential for supporting researchers’ information access in the biomedical domain. Conventional methods have regarded the task as a sequence labeling task based on sequential sentence classification, i.e., they assign a rhetorical label to each sentence by considering the context in the abstract. However, these methods have a critical problem: they are prone to mislabel longer continuous sentences with the same rhetorical label. To tackle the problem, we propose sequential span classification that assigns a rhetorical label, not to a single sentence but to a span that consists of continuous sentences. Accordingly, we introduce Neural Semi-Markov Conditional Random Fields to assign the labels to such spans by considering all possible spans of various lengths. Experimental results obtained from PubMed 20k RCT and NICTA-PIBOSO datasets demonstrate that our proposed method achieved the best micro sentence-F1 score as well as the best micro span-F1 score.

2019

pdf bib abs

Generating Natural Anagrams: Towards Language Generation Under Hard Combinatorial Constraints
Masaaki Nishino | Sho Takase | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

An anagram is a sentence or a phrase that is made by permutating the characters of an input sentence or a phrase. For example, “Trims cash” is an anagram of “Christmas”. Existing automatic anagram generation methods can find possible combinations of words form an anagram. However, they do not pay much attention to the naturalness of the generated anagrams. In this paper, we show that simple depth-first search can yield natural anagrams when it is combined with modern neural language models. Human evaluation results show that the proposed method can generate significantly more natural anagrams than baseline methods.

pdf bib abs

Split or Merge: Which is Better for Unsupervised RST Parsing?
Naoki Kobayashi | Tsutomu Hirao | Kengo Nakamura | Hidetaka Kamigaito | Manabu Okumura | Masaaki Nagata
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Rhetorical Structure Theory (RST) parsing is crucial for many downstream NLP tasks that require a discourse structure for a text. Most of the previous RST parsers have been based on supervised learning approaches. That is, they require an annotated corpus of sufficient size and quality, and heavily rely on the language and domain dependent corpus. In this paper, we present two language-independent unsupervised RST parsing methods based on dynamic programming. The first one builds the optimal tree in terms of a dissimilarity score function that is defined for splitting a text span into smaller ones. The second builds the optimal tree in terms of a similarity score function that is defined for merging two adjacent spans into a large one. Experimental results on English and German RST treebanks showed that our parser based on span merging achieved the best score, around 0.8 F₁ score, which is close to the scores of the previous supervised parsers.

pdf bib abs

NTT’s Machine Translation Systems for WMT19 Robustness Task
Soichiro Murakami | Makoto Morishita | Tsutomu Hirao | Masaaki Nagata
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes NTT’s submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previous baseline. Experimental results revealed the placeholder mechanism, which temporarily replaces the non-standard tokens including emojis and emoticons with special placeholder tokens during translation, improves translation accuracy even with noisy texts.

2018

pdf bib abs

Automatic Pyramid Evaluation Exploiting EDU-based Extractive Reference Summaries
Tsutomu Hirao | Hidetaka Kamigaito | Masaaki Nagata
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper tackles automation of the pyramid method, a reliable manual evaluation framework. To construct a pyramid, we transform human-made reference summaries into extractive reference summaries that consist of Elementary Discourse Units (EDUs) obtained from source documents and then weight every EDU by counting the number of extractive reference summaries that contain the EDU. A summary is scored by the correspondences between EDUs in the summary and those in the pyramid. Experiments on DUC and TAC data sets show that our methods strongly correlate with various manual evaluations.

pdf bib abs

Provable Fast Greedy Compressive Summarization with Any Monotone Submodular Function
Shinsaku Sakaue | Tsutomu Hirao | Masaaki Nishino | Masaaki Nagata
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Submodular maximization with the greedy algorithm has been studied as an effective approach to extractive summarization. This approach is known to have three advantages: its applicability to many useful submodular objective functions, the efficiency of the greedy algorithm, and the provable performance guarantee. However, when it comes to compressive summarization, we are currently missing a counterpart of the extractive method based on submodularity. In this paper, we propose a fast greedy method for compressive summarization. Our method is applicable to any monotone submodular objective function, including many functions well-suited for document summarization. We provide an approximation guarantee of our greedy algorithm. Experiments show that our method is about 100 to 400 times faster than an existing method based on integer-linear-programming (ILP) formulations and that our method empirically achieves more than 95%-approximation.

pdf bib abs

Higher-Order Syntactic Attention Network for Longer Sentence Compression
Hidetaka Kamigaito | Katsuhiko Hayashi | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

A sentence compression method using LSTM can generate fluent compressed sentences. However, the performance of this method is significantly degraded when compressing longer sentences since it does not explicitly handle syntactic features. To solve this problem, we propose a higher-order syntactic attention network (HiSAN) that can handle higher-order dependency features as an attention distribution on LSTM hidden states. Furthermore, to avoid the influence of incorrect parse results, we trained HiSAN by maximizing jointly the probability of a correct output with the attention distribution. Experimental results on Google sentence compression dataset showed that our method achieved the best performance on F1 as well as ROUGE-1,2 and L scores, 83.2, 82.9, 75.8 and 82.7, respectively. In human evaluation, our methods also outperformed baseline methods in both readability and informativeness.

pdf bib abs

Pruning Basic Elements for Better Automatic Evaluation of Summaries
Ukyo Honda | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a simple but highly effective automatic evaluation measure of summarization, pruned Basic Elements (pBE). Although the BE concept is widely used for the automated evaluation of summaries, its weakness is that it redundantly matches basic elements. To avoid this redundancy, pBE prunes basic elements by (1) disregarding frequency count of basic elements and (2) reducing semantically overlapped basic elements based on word similarity. Even though it is simple, pBE outperforms ROUGE in DUC datasets in most cases and achieves the highest rank correlation coefficient in TAC 2011 AESOP task.

2017

pdf bib abs

Oracle Summaries of Compressive Summarization
Tsutomu Hirao | Masaaki Nishino | Masaaki Nagata
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper derives an Integer Linear Programming (ILP) formulation to obtain an oracle summary of the compressive summarization paradigm in terms of ROUGE. The oracle summary is essential to reveal the upper bound performance of the paradigm. Experimental results on the DUC dataset showed that ROUGE scores of compressive oracles are significantly higher than those of extractive oracles and state-of-the-art summarization systems. These results reveal that compressive summarization is a promising paradigm and encourage us to continue with the research to produce informative summaries.

pdf bib abs

Supervised Attention for Sequence-to-Sequence Constituency Parsing
Hidetaka Kamigaito | Katsuhiko Hayashi | Tsutomu Hirao | Hiroya Takamura | Manabu Okumura | Masaaki Nagata
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The sequence-to-sequence (Seq2Seq) model has been successfully applied to machine translation (MT). Recently, MT performances were improved by incorporating supervised attention into the model. In this paper, we introduce supervised attention to constituency parsing that can be regarded as another translation task. Evaluation results on the PTB corpus showed that the bracketing F-measure was improved by supervised attention.

pdf bib abs

Enumeration of Extractive Oracle Summaries
Tsutomu Hirao | Masaaki Nishino | Jun Suzuki | Masaaki Nagata
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of ROUGE-N. We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.

Tsutomu Hirao

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2010

2009

2005

2004

2003

2002

Co-authors

Venues