Tatsuya Ishigaki

2025

VitaEval: Open-source Human Evaluation Tool for Video-to-Text and Video-to-Audio Systems
Goran Topic | Yuki Saito | Katsuhito Sudoh | Shinnosuke Takamichi | Hiroya Takamura | Graham Neubig | Tatsuya Ishigaki
Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations

pdf bib abs

Live Football Commentary (LFC): A Large‐Scale Dataset for Building Football Commentary Generation Models
Taiga Someya | Tatsuya Ishigaki | Hiroya Takamura
Proceedings of the 18th International Natural Language Generation Conference

Live football commentary brings the atmosphere and excitement of matches to fans in real time, but producing it requires costly professional announcers. We address this challenge by formulating commentary generation from player- and ball-tracking coordinates as a new language–generation task. To facilitate research on this problem we compile the Live Football Commentary (LFC) dataset, 12,440 time-stamped Japanese utterances aligned with tracking data for 40 J1 League matches ( 60 h). We benchmark three LLM-based baselines that receive the tracking data (i) as plain text, (ii) as pitch-map images, or (iii) in both modalities. Human evaluation shows that the text encoding already outperforms image and multimodal variants in both accuracy and relevance, indicating that current LLMs exploit structured coordinates more effectively than raw visuals. We release the LFC transcripts and evaluation code to establish a public test bed and spur future work on tracking-based commentary generation, saliency detection, and cross-modal integration.

pdf bib abs

Evaluating LLMs’ Ability to Understand Numerical Time Series for Text Generation
Mizuki Arai | Tatsuya Ishigaki | Masayuki Kawarada | Yusuke Miyao | Hiroya Takamura | Ichiro Kobayashi
Proceedings of the 18th International Natural Language Generation Conference

Data-to-text generation tasks often involve processing numerical time-series as input such as financial statistics or meteorological data. Although large language models (LLMs) are a powerful approach to data-to-text, we still lack a comprehensive understanding of how well they actually understand time-series data. We therefore introduce a benchmark with 18 evaluation tasks to assess LLMs’ abilities of interpreting numerical time-series, which are categorized into: 1) event detection—identifying maxima and minima; 2) computation—averaging and summation; 3) pairwise comparison—comparing values over time; and 4) inference—imputation and forecasting. Our experiments reveal five key findings: 1) even state-of-the-art LLMs struggle with complex multi-step reasoning; 2) tasks that require extracting values or performing computations within a specified range of the time-series significantly reduce accuracy; 3) instruction tuning offers inconsistent improvements for numerical interpretation; 4) reasoning-based models outperform standard LLMs in complex numerical tasks; and 5) LLMs perform interpolation better than forecasting. These results establish a clear baseline and serve as a wake-up call for anyone aiming to blend fluent language with trustworthy numeric precision in time-series scenarios.

pdf bib abs

Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.

pdf bib

Overview of PBIG Shared Task at AgentScen 2025: Product Business Idea Generation from Patents
Wataru Hirota | Chung-Chi Chen | Tomoko Ohkuma | Tomoki Taniguchi | Tatsuya Ishigaki
Proceedings of the 2nd Workshop on Agent AI for Scenario Planning

pdf bib

Proceedings of the 2nd Workshop on Agent AI for Scenario Planning
Chung-Chi Chen | Tatsuya Ishigaki | Sophia Ananiadou | Hiroya Takamura
Proceedings of the 2nd Workshop on Agent AI for Scenario Planning

pdf bib abs

Previous research on sports commentary generation has primarily focused on describing major events in the match.However, real-world commentary often includes comments beyond what is visible in the video content, e.g., “Florentina has acquired him for 7 million euros.”For enhancing the viewing experience with such background information,we developed an audio commentary system for football matches that generates utterances with background information, as well as play-by-play commentary.Our system first extracts visual information, and determines whether it is an appropriate timing to produce an utterance.Then it decides which type of utterance to generate: play-by-play or background information. In the latter case, the system leverages external knowledge through retrieval-augmented generation.

pdf bib abs

Exploring the Design of Multi-Agent LLM Dialogues for Research Ideation
Keisuke Ueda | Wataru Hirota | Kosuke Takahashi | Takahiro Omi | Kosuke Arima | Tatsuya Ishigaki
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation–critique–revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation.

2024

pdf bib

Pretraining and Updates of Domain-Specific LLM: A Case Study in the Japanese Business Domain
Kosuke Takahashi | Takahiro Omi | Kosuke Arima | Tatsuya Ishigaki
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib

Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
Chung-Chi Chen | Tatsuya Ishigaki | Hiroya Takamura | Akihiko Murai | Suzuko Nishino | Hen-Hsen Huang | Hsin-Hsi Chen
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning

pdf bib abs

Leveraging Plug-and-Play Models for Rhetorical Structure Control in Text Generation
Yuka Yokogawa | Tatsuya Ishigaki | Hiroya Takamura | Yusuke Miyao | Ichiro Kobayashi
Proceedings of the 17th International Natural Language Generation Conference

We propose a method that extends a BART-based language generator using a plug-and-play model to control the rhetorical structure of generated text. Our approach considers rhetorical relations between clauses and generates sentences that reflect this structure using plug-and-play language models. We evaluated our method using the Newsela corpus, which consists of texts at various levels of English proficiency. Our experiments demonstrated that our method outperforms the vanilla BART in terms of the correctness of output discourse and rhetorical structures. In existing methods, the rhetorical structure tends to deteriorate when compared to the baseline, the vanilla BART, as measured by n-gram overlap metrics such as BLEU. However, our proposed method does not exhibit this significant deterioration, demonstrating its advantage.

pdf bib abs

Prompting for Numerical Sequences: A Case Study on Market Comment Generation
Masayuki Kawarada | Tatsuya Ishigaki | Hiroya Takamura
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) have been applied to a wide range of data-to-text generation tasks, including tables, graphs, and time-series numerical data-to-text settings. While research on generating prompts for structured data such as tables and graphs is gaining momentum, in-depth investigations into prompting for time-series numerical data are lacking. Therefore, this study explores various input representations, including sequences of tokens and structured formats such as HTML, LaTeX, and Python-style codes. In our experiments, we focus on the task of Market Comment Generation, which involves taking a numerical sequence of stock prices as input and generating a corresponding market comment. Contrary to our expectations, the results show that prompts resembling programming languages yield better outcomes, whereas those similar to natural languages and longer formats, such as HTML and LaTeX, are less effective. Our findings offer insights into creating effective prompts for tasks that generate text from numerical sequences.

pdf bib

Evaluating LlaMA-2’s Adaptation to Social Context in Japanese Emails via Fine-Tuning
Muxuan Liu | Tatsuya Ishigaki | Yusuke Miyao | Hiroya Takamura | Ichiro Kobayashi
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Demonstration Selection Strategies for Numerical Time Series Data-to-Text
Masayuki Kawarada | Tatsuya Ishigaki | Goran Topić | Hiroya Takamura
Findings of the Association for Computational Linguistics: EMNLP 2024

Demonstration selection, the process of selecting examples used in prompts, plays a critical role in in-context learning. This paper explores demonstration selection methods for data-to-text tasks that involve numerical time series data as inputs.Previously developed demonstration selection methods primarily focus on textual inputs, often relying on embedding similarities of textual tokens to select similar instances from an example bank. However, this approach may not be suitable for numerical time series data.To address this issue, we propose two novel selection methods: (1) sequence similarity-based selection using various similarity measures, and (2) task-specific knowledge-based selection.From our experiments on two benchmark datasets, we found that our proposed models significantly outperform baseline selections and often surpass fine-tuned models. We also found that scale-invariant similarity measures such as Pearson’s correlation work better than scale-variant measures such as Euclidean distance.Manual evaluation by human judges also confirms that our proposed methods outperform conventional methods.

2023

pdf bib abs

Pretraining Language- and Domain-Specific BERT on Automatically Translated Text
Tatsuya Ishigaki | Yui Uehara | Goran Topić | Hiroya Takamura
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Domain-specific pretrained language models such as SciBERT are effective for various tasks involving text in specific domains. However, pretraining BERT requires a large-scale language resource, which is not necessarily available in fine-grained domains, especially in non-English languages. In this study, we focus on a setting with no available domain-specific text for pretraining. To this end, we propose a simple framework that trains a BERT on text in the target language automatically translated from a resource-rich language, e.g., English. In this paper, we particularly focus on the materials science domain in Japanese. Our experiments pertain to the task of entity and relation extraction for this domain and language. The experiments demonstrate that the various models pretrained on translated texts consistently perform better than the general BERT in terms of F1 scores although the domain-specific BERTs do not use any human-authored domain-specific text. These results imply that BERTs for various low-resource domains can be successfully trained on texts automatically translated from resource-rich languages.

pdf bib

Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model
Kosuke Takahashi | Takahiro Omi | Kosuke Arima | Tatsuya Ishigaki
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib

Constructing a Japanese Business Email Corpus Based on Social Situations
Muxuan Liu | Tatsuya Ishigaki | Yusuke Miyao | Hiroya Takamura | Ichiro Kobayashi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Live commentaries are essential for enhancing spectators’ enjoyment and understanding during sports events or e-sports streams. We introduce a live audio commentator system designed specifically for a racing game, driven by the high demand in the e-sports field. While a player is playing a racing game, our system tracks real-time user play data including speed and steer rotations, and generates commentary to accompany the live stream. Human evaluation suggested that generated commentary enhances enjoyment and understanding of races compared to streams without commentary. Incorporating additional modules to improve diversity and detect irregular events, such as course-outs and collisions, further increases the preference for the output commentaries.

2022

pdf bib abs

Live commentary plays an important role in sports broadcasts and video games, making spectators more excited and immersed. In this context, though approaches for automatically generating such commentary have been proposed in the past, they have been generally concerned with specific fields, where it is possible to leverage domain-specific information. In light of this, we propose the task of generating video commentary in an open-domain fashion. We detail the construction of a new large-scale dataset of transcribed commentary aligned with videos containing various human actions in a variety of domains, and propose approaches based on well-known neural architectures to tackle the task. To understand the strengths and limitations of current approaches, we present an in-depth empirical study based on our data. Our results suggest clear trade-offs between textual and visual inputs for the models and highlight the importance of relying on external knowledge in this open-domain setting, resulting in a set of robust baselines for our task.

pdf bib abs

We introduce document retrieval and comment generation tasks for automating horizon scanning. This is an important task in the field of futurology that collects sufficient information for predicting drastic societal changes in the mid- or long-term future. The steps used are: 1) retrieving news articles that imply drastic changes, and 2) writing subjective comments on each article for others’ ease of understanding. As a first step in automating these tasks, we create a dataset that contains 2,266 manually collected news articles with comments written by experts. We analyze the collected documents and comments regarding characteristic words, the distance to general articles, and contents in the comments. Furthermore, we compare several methods for automating horizon scanning. Our experiments show that 1) manually collected articles are different from general articles regarding the words used and semantic distances, 2) the contents in the comment can be classified into several categories, and 3) a supervised model trained on our dataset achieves a better performance. The contributions are: 1) we propose document retrieval and comment generation tasks for horizon scanning, 2) create and analyze a new dataset, and 3) report the performance of several models and show that comment generation tasks are challenging.

2021

pdf bib abs

We propose the task of automatically generating commentaries for races in a motor racing game, from vision, structured numerical, and textual data. Commentaries provide information to support spectators in understanding events in races. Commentary generation models need to interpret the race situation and generate the correct content at the right moment. We divide the task into two subtasks: utterance timing identification and utterance generation. Because existing datasets do not have such alignments of data in multiple modalities, this setting has not been explored in depth. In this study, we introduce a new large-scale dataset that contains aligned video data, structured numerical data, and transcribed commentaries that consist of 129,226 utterances in 1,389 races in a game. Our analysis reveals that the characteristics of commentaries change over time or from viewpoints. Our experiments on the subtasks show that it is still challenging for a state-of-the-art vision encoder to capture useful information from videos to generate accurate commentaries. We make the dataset and baseline implementation publicly available for further research.

pdf bib

Unpredictable Attributes in Market Comment Generation
Yumi Hamazono | Tatsuya Ishigaki | Yusuke Miyao | Hiroya Takamura | Ichiro Kobayashi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf bib abs

Existing models for data-to-text tasks generate fluent but sometimes incorrect sentences e.g., “Nikkei gains” is generated when “Nikkei drops” is expected. We investigate models trained on contrastive examples i.e., incorrect sentences or terms, in addition to correct ones to reduce such errors. We first create rules to produce contrastive examples from correct ones by replacing frequent crucial terms such as “gain” or “drop”. We then use learning methods with several losses that exploit contrastive examples. Experiments on the market comment generation task show that 1) exploiting contrastive examples improves the capability of generating sentences with better lexical choice, without degrading the fluency, 2) the choice of the loss function is an important factor because the performances on different metrics depend on the types of loss functions, and 3) the use of the examples produced by some specific rules further improves performance. Human evaluation also supports the effectiveness of using contrastive examples.

2019

pdf bib abs

We propose a data-to-document generator that can easily control the contents of output texts based on a neural language model. Conventional data-to-text model is useful when a reader seeks a global summary of data because it has only to describe an important part that has been extracted beforehand. However, because depending on users, it differs what they are interested in, so it is necessary to develop a method to generate various summaries according to users’ interests. We develop a model to generate various summaries and to control their contents by providing the explicit targets for a reference to the model as controllable factors. In the experiments, we used five-minute or one-hour charts of 9 indicators (e.g., Nikkei225), as time-series data, and daily summaries of Nikkei Quick News as textual data. We conducted comparative experiments using two pieces of information: human-designed topic labels indicating the contents of a sentence and automatically extracted keywords as the referential information for generation.

pdf bib abs

We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our proposed model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generations. Experimental results show that our proposed model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.

pdf bib abs

Discourse-Aware Hierarchical Attention Network for Extractive Single-Document Summarization
Tatsuya Ishigaki | Hidetaka Kamigaito | Hiroya Takamura | Manabu Okumura
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Discourse relations between sentences are often represented as a tree, and the tree structure provides important information for summarizers to create a short and coherent summary. However, current neural network-based summarizers treat the source document as just a sequence of sentences and ignore the tree-like discourse structure inherent in the document. To incorporate the information of a discourse tree structure into the neural network-based summarizers, we propose a discourse-aware neural extractive summarizer which can explicitly take into account the discourse dependency tree structure of the source document. Our discourse-aware summarizer can jointly learn the discourse structure and the salience score of a sentence by using novel hierarchical attention modules, which can be trained on automatically parsed discourse dependency trees. Experimental results showed that our model achieved competitive or better performances against state-of-the-art models in terms of ROUGE scores on the DailyMail dataset. We further conducted manual evaluations. The results showed that our approach also gained the coherence of the output summaries.

2018

pdf bib abs

Comments on a stock market often include the reason or cause of changes in stock prices, such as “Nikkei turns lower as yen’s rise hits exporters.” Generating such informative sentences requires capturing the relationship between different resources, including a target stock price. In this paper, we propose a model for automatically generating such informative market comments that refer to external resources. We evaluated our model through an automatic metric in terms of BLEU and human evaluation done by an expert in finance. The results show that our model outperforms the existing model both in BLEU scores and human judgment.

2017

pdf bib abs

Summarizing Lengthy Questions
Tatsuya Ishigaki | Hiroya Takamura | Manabu Okumura
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this research, we propose the task of question summarization. We first analyzed question-summary pairs extracted from a Community Question Answering (CQA) site, and found that a proportion of questions cannot be summarized by extractive approaches but requires abstractive approaches. We created a dataset by regarding the question-title pairs posted on the CQA site as question-summary pairs. By using the data, we trained extractive and abstractive summarization models, and compared them based on ROUGE scores and manual evaluations. Our experimental results show an abstractive method using an encoder-decoder model with a copying mechanism achieves better scores for both ROUGE-2 F-measure and the evaluations by human judges.

Co-authors

Venues

WS2