Takashi Kodama

2026

Scaling Data-Constrained Language Models with Synthetic Data
Hirokazu Kiyomaru | Yusuke Oda | Takashi Kodama | Chaoran Liu | Daisuke Kawahara
Findings of the Association for Computational Linguistics: EACL 2026

Large language models (LLMs) improve with more training data, but practical limits on data collection increasingly constrain further scaling. Advances in instruction-following LLMs have enabled controlled, high-quality text generation, making synthetic data a promising remedy. However, its effectiveness for pre-training non-English LLMs remains underexplored. We study this question for Japanese in a fixed token budget setting in which organic Japanese Web text constitutes only a small share, while far more organic English Web text and instruction-following LLMs capable of generating fluent Japanese are available. We compare three strategies to fill the data shortfall: generating synthetic Japanese text, repeating the limited Japanese Web text, and using English Web text. Experiments show that synthetic Japanese corpora outperform both baselines and approach the performance achieved when the entire token budget is filled with additional organic Japanese Web text.

pdf bib abs

Comprehensive Study of Bilingual and Multi-category Instruction Pre-training
Takashi Kodama | Yusuke Oda
Findings of the Association for Computational Linguistics: EACL 2026

Instruction pre-training (IPT) has recently emerged as an effective intermediate stage between vanilla pre-training and post-training for large language models (LLMs). However, the optimal design of IPT corpora—such as the balance between raw and instruction-response data, languages, and task categories—remains unclear. We systematically study IPT corpus composition using a bilingual (English and Japanese) and multi-category (coding, general, math, and reasoning) instruction-response dataset. Through extensive IPT experiments across four base models, including both English-centric and bilingual LLMs, we find that: (1) more instruction-response data generally enhances model performance, particularly for models with large VPT budgets; (2) Japanese instruction data can improve English performance through cross-lingual transfer; and (3) the effectiveness of post-training varies across categories: coding performance is largely determined during IPT, while math and reasoning continue to improve during post-training.

2025

In human-human conversation, interpersonal consideration for the interlocutor is essential, and similar expectations are increasingly placed on dialogue systems. This study examines the behavior of dialogue systems in a specific interpersonal scenario where a user vents frustrations and seeks emotional support from a long-time friend represented by a dialogue system. We conducted a human evaluation and qualitative analysis of 15 dialogue systems under this setting. These systems implemented diverse strategies, such as structuring dialogue into distinct phases, modeling interpersonal relationships, and incorporating cognitive behavioral therapy techniques. Our analysis reveals that these approaches contributed to improved perceived empathy, coherence, and appropriateness, highlighting the importance of design choices in socially sensitive dialogue.

pdf bib abs

Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Momoka Furuhashi | Kouta Nakayama | Takashi Kodama | Saku Sugawara
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored.We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations.Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring.Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations.Our code is available at https://github.com/momo0817/checklist-effectiveness-study.

Challenges in multimodal task-oriented dialogue between humans and systems, particularly those involving audio and visual interactions, have not been sufficiently explored or shared, forcing researchers to define improvement directions individually without a clearly shared roadmap. To address these challenges, we organized a competition for multimodal task-oriented dialogue systems and constructed a large competition-based dataset of 1,865 minutes of Japanese task-oriented dialogues. This dataset includes audio and visual interactions between diverse systems and human participants. After analyzing system behaviors identified as problematic by the human participants in questionnaire surveys and notable methods employed by the participating teams, we identified key challenges in multimodal task-oriented dialogue systems and discussed potential directions for overcoming these challenges.

2024

pdf bib abs

Domain Transferable Semantic Frames for Expert Interview Dialogues
Taishi Chika | Taro Okahisa | Takashi Kodama | Yin Jou Huang | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Interviews are an effective method to elicit critical skills to perform particular processes in various domains. In order to understand the knowledge structure of these domain-specific processes, we consider semantic role and predicate annotation based on Frame Semantics. We introduce a dataset of interview dialogues with experts in the culinary and gardening domains, each annotated with semantic frames. This dataset consists of (1) 308 interview dialogues related to the culinary domain, originally assembled by Okahisa et al. (2022), and (2) 100 interview dialogues associated with the gardening domain, which we newly acquired. The labeling specifications take into account the domain-transferability by adopting domain-agnostic labels for frame elements. In addition, we conducted domain transfer experiments from the culinary domain to the gardening domain to examine the domain transferability with our dataset. The experimental results showed the effectiveness of our domain-agnostic labeling scheme.

pdf bib abs

RecomMind: Movie Recommendation Dialogue with Seeker’s Internal State
Takashi Kodama | Hirokazu Kiyomaru | Yin Jou Huang | Sadao Kurohashi
Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)

Humans pay careful attention to the interlocutor’s internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker’s internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis and experiment, we constructed RecomMind, a movie recommendation dialogue dataset with annotations of the seeker’s internal state at the entity level. Each entity has a first-person label annotated by the seeker and a second-person label annotated by the recommender. Our analysis based on RecomMind reveals that the success of recommendations is enhanced when recommenders mention entities that seekers do not know but are interested in. We also propose a response generation framework that explicitly considers the seeker’s internal state, utilizing the chain-of-thought prompting. The human evaluation results show that our proposed method outperforms the baseline method in both consistency and the success of recommendations.

2023

pdf bib abs

KWJA: A Unified Japanese Analyzer Based on Foundation Models
Nobuhiro Ueda | Kazumasa Omura | Takashi Kodama | Hirokazu Kiyomaru | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We present KWJA, a high-performance unified Japanese text analyzer based on foundation models.KWJA supports a wide range of tasks, including typo correction, word segmentation, word normalization, morphological analysis, named entity recognition, linguistic feature tagging, dependency parsing, PAS analysis, bridging reference resolution, coreference resolution, and discourse relation analysis, making it the most versatile among existing Japanese text analyzers.KWJA solves these tasks in a multi-task manner but still achieves competitive or better performance compared to existing analyzers specialized for each task.KWJA is publicly available under the MIT license at https://github.com/ku-nlp/kwja.

pdf bib abs

Is a Knowledge-based Response Engaging?: An Analysis on Knowledge-Grounded Dialogue with Information Source Annotation
Takashi Kodama | Hirokazu Kiyomaru | Yin Jou Huang | Taro Okahisa | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Currently, most knowledge-grounded dialogue response generation models focus on reflecting given external knowledge. However, even when conveying external knowledge, humans integrate their own knowledge, experiences, and opinions with external knowledge to make their utterances engaging. In this study, we analyze such human behavior by annotating the utterances in an existing knowledge-grounded dialogue corpus. Each entity in the corpus is annotated with its information source, either derived from external knowledge (database-derived) or the speaker’s own knowledge, experiences, and opinions (speaker-derived). Our analysis shows that the presence of speaker-derived information in the utterance improves dialogue engagingness. We also confirm that responses generated by an existing model, which is trained to reflect the given knowledge, cannot include speaker-derived information in responses as often as humans do.

2022

pdf bib abs

Explicit Use of Topicality in Dialogue Response Generation
Takumi Yoshikoshi | Hayato Atarashi | Takashi Kodama | Sadao Kurohashi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

The current chat dialogue systems implicitly consider the topic given the context, but not explicitly. As a result, these systems often generate inconsistent responses with the topic of the moment. In this study, we propose a dialogue system that responds appropriately following the topic by selecting the entity with the highest “topicality.” In topicality estimation, the model is trained through self-supervised learning that regards entities that appear in both context and response as the topic entities. In response generation, the model is trained to generate topic-relevant responses based on the estimated topicality. Experimental results show that our proposed system can follow the topic more than the existing dialogue system that considers only the context.

pdf bib abs

Constructing a Culinary Interview Dialogue Corpus with Video Conferencing Tool
Taro Okahisa | Ribeka Tanaka | Takashi Kodama | Yin Jou Huang | Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Interview is an efficient way to elicit knowledge from experts of different domains. In this paper, we introduce CIDC, an interview dialogue corpus in the culinary domain in which interviewers play an active role to elicit culinary knowledge from the cooking expert. The corpus consists of 308 interview dialogues (each about 13 minutes in length), which add up to a total of 69,000 utterances. We use a video conferencing tool for data collection, which allows us to obtain the facial expressions of the interlocutors as well as the screen-sharing contents. To understand the impact of the interlocutors’ skill level, we divide the experts into “semi-professionals’” and “enthusiasts” and the interviewers into “skilled interviewers” and “unskilled interviewers.” For quantitative analysis, we report the statistics and the results of the post-interview questionnaire. We also conduct qualitative analysis on the collected interview dialogues and summarize the salient patterns of how interviewers elicit knowledge from the experts. The corpus serves the purpose to facilitate future research on the knowledge elicitation mechanism in interview dialogues.

pdf bib abs

Construction of Hierarchical Structured Knowledge-based Recommendation Dialogue Dataset and Dialogue System
Takashi Kodama | Ribeka Tanaka | Sadao Kurohashi
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

We work on a recommendation dialogue system to help a user understand the appealing points of some target (e.g., a movie). In such dialogues, the recommendation system needs to utilize structured external knowledge to make informative and detailed recommendations. However, there is no dialogue dataset with structured external knowledge designed to make detailed recommendations for the target. Therefore, we construct a dialogue dataset, Japanese Movie Recommendation Dialogue (JMRD), in which the recommender recommends one movie in a long dialogue (23 turns on average). The external knowledge used in this dataset is hierarchically structured, including title, casts, reviews, and plots. Every recommender’s utterance is associated with the external knowledge related to the utterance. We then create a movie recommendation dialogue system that considers the structure of the external knowledge and the history of the knowledge used. Experimental results show that the proposed model is superior in knowledge selection to the baseline models.

2020

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

pdf bib abs

This paper concerns the problem of realizing consistent personalities in neural conversational modeling by using user generated question-answer pairs as training data. Using the framework of role play-based question answering, we collected single-turn question-answer pairs for particular characters from online users. Meta information was also collected such as emotion and intimacy related to question-answer pairs. We verified the quality of the collected data and, by subjective evaluation, we also verified their usefulness in training neural conversational models for generating utterances reflecting the meta information, especially emotion.