Hiroki Teranishi

2025

Overcoming Data Scarcity in Named Entity Recognition: Synthetic Data Generation with Large Language Models
An Dao | Hiroki Teranishi | Yuji Matsumoto | Florian Boudin | Akiko Aizawa
Proceedings of the 24th Workshop on Biomedical Language Processing

Named Entity Recognition (NER) is crucial for extracting domain-specific entities from text, particularly in biomedical and chemical fields. Developing high-quality NER models in specialized domains is challenging due to the limited availability of annotated data, with manual annotation being a key method of data construction. However, manual annotation is time-consuming and requires domain expertise, making it difficult in specialized domains. Traditional data augmentation (DA) techniques also rely on annotated data to some extent, further limiting their effectiveness. In this paper, we propose a novel approach to synthetic data generation for NER using large language models (LLMs) to generate sentences based solely on a set of example entities. This method simplifies the augmentation process and is effective even with a limited set of entities.We evaluate our approach using BERT-based models on the BC4CHEMD, BC5CDR, and TDMSci datasets, demonstrating that synthetic data significantly improves model performance and robustness, particularly in low-resource settings. This work provides a scalable solution for enhancing NER in specialized domains, overcoming the limitations of manual annotation and traditional augmentation methods.

pdf bib abs

Human traveling trajectories play a central role in characterizing each travelogue, and automatic trajectory extraction from travelogues is highly desired for tourism services, such as travel planning and recommendation. This work addresses the extraction of human traveling trajectories from travelogues. Previous work treated each trajectory as a sequence of visited locations, although locations with different granularity levels, e.g., “Kyoto City” and “Kyoto Station,” should not be lined up in a sequence. In this work, we propose to represent the trajectory as a graph that can capture the hierarchy as well as the visiting order, and construct a benchmark dataset for the trajectory extraction. The experiments using this dataset show that even naive baseline systems can accurately predict visited locations and the visiting order between them, while it is more challenging to predict the hierarchical relations.

pdf bib abs

Corporate history in corporate annual reports includes events related to organizational changes, which can provide useful cues for a comprehensive understanding of corporate actions.However, extracting organizational changes requires identifying differences in companies before and after an event, raising concerns about whether existing information extraction systems can accurately capture the relations.This work introduces JaCorpTrack, a novel event extraction task designed to identify events related to organizational changes.JaCorpTrack defines five event types related to organizational changes and is designed to identify the company names before and after each event, as well as the corresponding date.Experimental results indicate that large language models (LLMs) exhibit notable disparities in performance across event types.Our analysis reveals that these systems face challenges in identifying company names before and after events, and in interpreting event types expressed under ambiguous terminology.We will publicly release our dataset and experimental code at https://github.com/naist-nlp/JaCorpTrack

pdf bib abs

A Text Embedding Model with Contrastive Example Mining for Point-of-Interest Geocoding
Hibiki Nakatani | Hiroki Teranishi | Shohei Higashiyama | Yuya Sawada | Hiroki Ouchi | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Geocoding is a fundamental technique that links location mentions to their geographic positions, which is important for understanding texts in terms of where the described events occurred. Unlike most geocoding studies that targeted coarse-grained locations, we focus on geocoding at a fine-grained point-of-interest (POI) level. To address the challenge of finding appropriate geo-database entries from among many candidates with similar POI names, we develop a text embedding-based geocoding model and investigate (1) entry encoding representations and (2) hard negative mining approaches suitable for enhancing the model’s disambiguation ability. Our experiments show that the second factor significantly impact the geocoding accuracy of the model.

2024

pdf bib abs

Geoparsing is a fundamental technique for analyzing geo-entity information in text, which is useful for geographic applications, e.g., tourist spot recommendation. We focus on document-level geoparsing that considers geographic relatedness among geo-entity mentions and present a Japanese travelogue dataset designed for training and evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

pdf bib abs

PolyNERE: A Novel Ontology and Corpus for Named Entity Recognition and Relation Extraction in Polymer Science Domain
Van-Thuy Phi | Hiroki Teranishi | Yuji Matsumoto | Hiroyuki Oka | Masashi Ishii
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Polymers are widely used in diverse fields, and the demand for efficient methods to extract and organize information about them is increasing. An automated approach that utilizes machine learning can accurately extract relevant information from scientific papers, providing a promising solution for automating information extraction using annotated training data. In this paper, we introduce a polymer-relevant ontology featuring crucial entities and relations to enhance information extraction in the polymer science field. Our ontology is customizable to adapt to specific research needs. We present PolyNERE, a high-quality named entity recognition (NER) and relation extraction (RE) corpus comprising 750 polymer abstracts annotated using our ontology. Distinctive features of PolyNERE include multiple entity types, relation categories, support for various NER settings, and the ability to assert entities and relations at different levels. PolyNERE also facilitates reasoning in the RE task through supporting evidence. While our experiments with recent advanced methods achieved promising results, challenges persist in adapting NER and RE from abstracts to full-text paragraphs. This emphasizes the need for robust information extraction systems in the polymer domain, making our corpus a valuable benchmark for future developments.

pdf bib abs

Synthetic Context with LLM for Entity Linking from Scientific Tables
Yuji Oshima | Hiroyuki Shindo | Hiroki Teranishi | Hiroki Ouchi | Taro Watanabe
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Tables in scientific papers contain crucial information, such as experimental results.Entity Linking (EL) is a promising technology that analyses tables and associates them with a knowledge base.EL for table cells requires identifying the referent concept of each cell while understanding the context relevant to each cell in the paper. However, extracting the relevant context from the paper is challenging because the relevant parts are scattered in the main text and captions.This study defines a rule-based method for extracting broad context from the main text, including table captions and sentences that mention the table.Furthermore, we propose synthetic context as a more refined context generated by large language models (LLMs).In a synthetic context, contexts from the entire paper are refined by summarizing, injecting supplemental knowledge, and clarifying the referent concept.We observe this approach improves accuracy for EL by more than 10 points on the S2abEL dataset, and our qualitative analysis suggests potential future works.

2022

pdf bib abs

Coordination Generation via Synchronized Text-Infilling
Hiroki Teranishi | Yuji Matsumoto
Proceedings of the 29th International Conference on Computational Linguistics

Generating synthetic data for supervised learning from large-scale pre-trained language models has enhanced performances across several NLP tasks, especially in low-resource scenarios. In particular, many studies of data augmentation employ masked language models to replace words with other words in a sentence. However, most of them are evaluated on sentence classification tasks and cannot immediately be applied to tasks related to the sentence structure. In this paper, we propose a simple yet effective approach to generating sentences with a coordinate structure in which the boundaries of its conjuncts are explicitly specified. For a given span in a sentence, our method embeds a mask with a coordinating conjunction in two ways (”X and [mask]”, ”[mask] and X”) and forces masked language models to fill the two blanks with an identical text. To achieve this, we introduce decoding methods for BERT and T5 models with the constraint that predictions for different masks are synchronized. Furthermore, we develop a training framework that effectively selects synthetic examples for the supervised coordination disambiguation task. We demonstrate that our method produces promising coordination instances that provide gains for the task in low-resource settings.

2020

pdf bib abs

We propose a simple method for nominal coordination boundary identification. As the main strength of our method, it can identify the coordination boundaries without training on labeled data, and can be applied even if coordination structure annotations are not available. Our system employs pre-trained word embeddings to measure the similarities of words and detects the span of coordination, assuming that conjuncts share syntactic and semantic similarities. We demonstrate that our method yields good results in identifying coordinated noun phrases in the GENIA corpus and is comparable to a recent supervised method for the case when the coordinator conjoins simple noun phrases.

2019

pdf bib abs

Decomposed Local Models for Coordinate Structure Parsing
Hiroki Teranishi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a simple and accurate model for coordination boundary identification. Our model decomposes the task into three sub-tasks during training; finding a coordinator, identifying inside boundaries of a pair of conjuncts, and selecting outside boundaries of it. For inference, we make use of probabilities of coordinators and conjuncts in the CKY parsing to find the optimal combination of coordinate structures. Experimental results demonstrate that our model achieves state-of-the-art results, ensuring that the global structure of coordinations is consistent.

2017

pdf bib abs

Coordination Boundary Identification with Similarity and Replaceability
Hiroki Teranishi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose a neural network model for coordination boundary detection. Our method relies on the two common properties - similarity and replaceability in conjuncts - in order to detect both similar pairs of conjuncts and dissimilar pairs of conjuncts. The model improves identification of clause-level coordination using bidirectional RNNs incorporating two properties as features. We show that our model outperforms the existing state-of-the-art methods on the coordination annotated Penn Treebank and Genia corpus without any syntactic information from parsers.

Co-authors

Venues

sdp1