Yuji Matsumoto

Also published as: Yūji Matsumoto

2026

Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition
Shanshan Liu | Noriki Nishida | Fei Cheng | Narumi Tokunaga | Rumana Ferdous Munne | Yuki Yamagata | Kouji Kozaki | Takehito Utsuro | Yuji Matsumoto
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio-ie-tool/hi-ald.

2025

pdf bib abs

Post Persona Alignment for Multi-Session Dialogue Generation
Yi-Pei Chen | Noriki Nishida | Hideki Nakayama | Yuji Matsumoto
Findings of the Association for Computational Linguistics: EMNLP 2025

Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker’s persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.

pdf bib abs

Overcoming Data Scarcity in Named Entity Recognition: Synthetic Data Generation with Large Language Models
An Dao | Hiroki Teranishi | Yuji Matsumoto | Florian Boudin | Akiko Aizawa
Proceedings of the 24th Workshop on Biomedical Language Processing

Named Entity Recognition (NER) is crucial for extracting domain-specific entities from text, particularly in biomedical and chemical fields. Developing high-quality NER models in specialized domains is challenging due to the limited availability of annotated data, with manual annotation being a key method of data construction. However, manual annotation is time-consuming and requires domain expertise, making it difficult in specialized domains. Traditional data augmentation (DA) techniques also rely on annotated data to some extent, further limiting their effectiveness. In this paper, we propose a novel approach to synthetic data generation for NER using large language models (LLMs) to generate sentences based solely on a set of example entities. This method simplifies the augmentation process and is effective even with a limited set of entities.We evaluate our approach using BERT-based models on the BC4CHEMD, BC5CDR, and TDMSci datasets, demonstrating that synthetic data significantly improves model performance and robustness, particularly in low-resource settings. This work provides a scalable solution for enhancing NER in specialized domains, overcoming the limitations of manual annotation and traditional augmentation methods.

pdf bib abs

Entity Profile Generation and Reasoning with LLMs for Entity Alignment
Rumana Ferdous Munne | Md Mostafizur Rahman | Yuji Matsumoto
Findings of the Association for Computational Linguistics: EMNLP 2025

Entity alignment (EA) involves identifying and linking equivalent entities across different knowledge graphs (KGs). While knowledge graphs provide structured information about real-world entities, only a small fraction of these entities are aligned. The entity alignment process is challenging due to heterogeneity in KGs, such as differences in structure, terminology, and attribute details. Traditional EA methods use multi-aspect entity embeddings to align entities. Although these methods perform well in certain scenarios, their effective- ness is often constrained by sparse or incomplete data in knowledge graphs and the limitations of embedding techniques. We propose ProLEA ( Profile Generation and Reasoning with LLMs for Entity Alignment) an entity alignment method that combines large language models (LLMs) with entity embed- dings. LLMs generate contextual profiles for entities based on their properties. Candidate entities identified by entity embedding techniques are then re-evaluated by the LLMs, using its background knowledge and the generated profile. A thresholding mechanism is introduced to resolve conflicts between LLMs predictions and embedding-based alignments. This method enhances alignment accuracy, robustness, and explainability, particularly for complex, het- erogeneous knowledge graphs. Furthermore, ProLEA is a generalized framework. Its profile generation and LLM-enhanced entity align- ment components can improve the performance of existing entity alignment models.

pdf bib abs

MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition
Shanshan Liu | Noriki Nishida | Rumana Ferdous Munne | Narumi Tokunaga | Yuki Yamagata | Kouji Kozaki | Yuji Matsumoto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language model (LLM)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.

pdf bib abs

PolyMinder: A Support System for Entity Annotation and Relation Extraction in Polymer Science Documents
Truong Dinh Do | An Hoang Trieu | Van-Thuy Phi | Minh Le Nguyen | Yuji Matsumoto
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

The growing volume of scientific literature in polymer science presents a significant challenge for researchers attempting to extract and annotate domain-specific entities, such as polymer names, material properties, and related information. Manual annotation of these documents is both time-consuming and prone to error due to the complexity of scientific language. To address this, we introduce PolyMinder, an automated support system designed to assist polymer scientists in extracting and annotating polymer-related entities and their relationships from scientific documents. The system utilizes recent advanced Named Entity Recognition (NER) and Relation Extraction (RE) models tailored to the polymer domain. PolyMinder streamlines the annotation process by providing a web-based interface where users can visualize, verify, and refine the extracted information before finalizing the annotations. The system’s source code is made publicly available to facilitate further research and development in this field. Our system can be accessed through the following URL: https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/polyminder

pdf bib abs

Corporate history in corporate annual reports includes events related to organizational changes, which can provide useful cues for a comprehensive understanding of corporate actions.However, extracting organizational changes requires identifying differences in companies before and after an event, raising concerns about whether existing information extraction systems can accurately capture the relations.This work introduces JaCorpTrack, a novel event extraction task designed to identify events related to organizational changes.JaCorpTrack defines five event types related to organizational changes and is designed to identify the company names before and after each event, as well as the corresponding date.Experimental results indicate that large language models (LLMs) exhibit notable disparities in performance across event types.Our analysis reveals that these systems face challenges in identifying company names before and after events, and in interpreting event types expressed under ambiguous terminology.We will publicly release our dataset and experimental code at https://github.com/naist-nlp/JaCorpTrack

pdf bib abs

Automatic biomedical annotation is essential for advancing medical research, diagnosis, and treatment. However, it presents significant challenges, especially when entities are not explicitly mentioned in the text, leading to difficulties in extraction of relevant information. These challenges are intensified by unclear terminology, implicit background knowledge, and the lack of labeled training data. Annotating with a specific ontology adds another layer of complexity, as it requires aligning text with a predefined set of concepts and relationships. Manual annotation is time-consuming and expensive, highlighting the need for automated systems to handle large volumes of biomedical data efficiently. In this paper, we propose an entailment-based zero-shot text classification approach to annotate biomedical text passages using the Homeostasis Imbalance Process (HOIP) ontology. Our method reformulates the annotation task as a multi-class, multi-label classification problem and uses natural language inference to classify text into related HOIP processes. Experimental results show promising performance, especially when processes are not explicitly mentioned, highlighting the effectiveness of our approach for ontological annotation of biomedical literature.

pdf bib abs

A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature
Van-Thuy Phi | Dinh-Truong Do | Hoang-An Trieu | Yuji Matsumoto
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications

Extracting structured information from tables in scientific literature is a critical yet challenging task for building domain-specific knowledge bases. This paper addresses extraction of 5-ary polymer property tuples: (POLYMER, PROP_NAME, PROP_VALUE, CONDITION, CHAR_METHOD). We introduce and systematically compare two distinct methodologies: (1) a novel two-stage Hybrid Pipeline that first utilizes Large Language Models (LLMs) for table-to-text conversion, which is then processed by specialized text-based extraction models; and (2) an end-to-end Direct LLM Extraction approach. To evaluate these methods, we employ a systematic, domain-aligned evaluation setup based on the expert-curated PoLyInfo database. Our results demonstrate the clear superiority of the hybrid pipeline. When using Claude Sonnet 4.5 for the linearization stage, the pipeline achieves a score of 67.92% F1@PoLyInfo, significantly outperforming the best direct LLM extraction approach (Claude Sonnet 4.5 at 56.66%). This work establishes the effectiveness of a hybrid architecture that combines the generative strengths of LLMs with the precision of specialized supervised models for structured data extraction.

pdf bib abs

A Unified Framework for N-ary Property Information Extraction in Materials Science
Van-Thuy Phi | Yuji Matsumoto
Findings of the Association for Computational Linguistics: EMNLP 2025

This paper presents a unified framework for extracting n-ary property information from materials science literature, addressing the critical challenge of capturing complex relationships that often span multiple sentences. We introduce three complementary approaches: RE-Composition, which transforms binary relations into n-ary structures; Direct EAE, which models polymer properties as events with multiple arguments; and LLM-Guided Assembly, which leverages high-confidence entity and relation outputs to guide structured extraction. Our framework is built upon two novel resources: MatSciNERE, a comprehensive corpus for materials science entities and relations, and PolyEE, a specialized corpus for polymer property events. Through strategic synthetic data generation for both NER and EAE tasks, we achieve significant performance improvements (up to 5.34 F1 points). Experiments demonstrate that our combined approaches outperform any single method, with the LLM-guided approach achieving the highest F1 score (71.53%). The framework enables more comprehensive knowledge extraction from scientific literature, supporting materials discovery and database curation applications. We plan to release our resources and trained models to the research community.

2024

pdf bib abs

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

pdf bib abs

Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations
Yi-Pei Chen | Noriki Nishida | Hideki Nakayama | Yuji Matsumoto
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Enhancing user engagement through personalization in conversational agents has gained significance, especially with the advent of large language models that generate fluent responses. Personalized dialogue generation, however, is multifaceted and varies in its definition – ranging from instilling a persona in the agent to capturing users’ explicit and implicit cues. This paper seeks to systemically survey the recent landscape of personalized dialogue generation, including the datasets employed, methodologies developed, and evaluation metrics applied. Covering 22 datasets, we highlight benchmark datasets and newer ones enriched with additional features. We further analyze 17 seminal works from top conferences between 2021-2023 and identify five distinct types of problems. We also shed light on recent progress by LLMs in personalized dialogue generation. Our evaluation section offers a comprehensive summary of assessment facets and metrics utilized in these works. In conclusion, we discuss prevailing challenges and envision prospect directions for future research in personalized dialogue generation.

pdf bib abs

Biomedical information extraction is crucial for advancing research, enhancing healthcare, and discovering treatments by efficiently analyzing extensive data. Given the extensive amount of biomedical data available, automated information extraction methods are necessary due to manual extraction’s labor-intensive, expertise-dependent, and costly nature. In this paper, we propose a novel two-stage system for information extraction where we annotate biomedical articles based on a specific ontology (HOIP). The major challenge is annotating relation between biomedical processes often not explicitly mentioned in text articles. Here, we first predict the candidate processes and then determine the relationships between these processes. The experimental results show promising outcomes in mention-agnostic process identification using Large Language Models (LLMs). In relation classification, BERT-based supervised models still outperform LLMs significantly. The end-to-end evaluation results suggest the difficulty of this task and room for improvement in both process identification and relation classification.

pdf bib abs

PolyNERE: A Novel Ontology and Corpus for Named Entity Recognition and Relation Extraction in Polymer Science Domain
Van-Thuy Phi | Hiroki Teranishi | Yuji Matsumoto | Hiroyuki Oka | Masashi Ishii
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Polymers are widely used in diverse fields, and the demand for efficient methods to extract and organize information about them is increasing. An automated approach that utilizes machine learning can accurately extract relevant information from scientific papers, providing a promising solution for automating information extraction using annotated training data. In this paper, we introduce a polymer-relevant ontology featuring crucial entities and relations to enhance information extraction in the polymer science field. Our ontology is customizable to adapt to specific research needs. We present PolyNERE, a high-quality named entity recognition (NER) and relation extraction (RE) corpus comprising 750 polymer abstracts annotated using our ontology. Distinctive features of PolyNERE include multiple entity types, relation categories, support for various NER settings, and the ability to assert entities and relations at different levels. PolyNERE also facilitates reasoning in the RE task through supporting evidence. While our experiments with recent advanced methods achieved promising results, challenges persist in adapting NER and RE from abstracts to full-text paragraphs. This emphasizes the need for robust information extraction systems in the polymer domain, making our corpus a valuable benchmark for future developments.

Yuji Matsumoto

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

Co-authors

Venues