2024
pdf
bib
abs
Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation
Shohei Higashiyama
|
Hiroki Ouchi
|
Hiroki Teranishi
|
Hiroyuki Otomo
|
Yusuke Ide
|
Aitaro Yamamoto
|
Hiroyuki Shindo
|
Yuki Matsuda
|
Shoko Wakamiya
|
Naoya Inoue
|
Ikuya Yamada
|
Taro Watanabe
Findings of the Association for Computational Linguistics: EACL 2024
Geoparsing is a fundamental technique for analyzing geo-entity information in text, which is useful for geographic applications, e.g., tourist spot recommendation. We focus on document-level geoparsing that considers geographic relatedness among geo-entity mentions and present a Japanese travelogue dataset designed for training and evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.
pdf
bib
abs
PolyNERE: A Novel Ontology and Corpus for Named Entity Recognition and Relation Extraction in Polymer Science Domain
Van-Thuy Phi
|
Hiroki Teranishi
|
Yuji Matsumoto
|
Hiroyuki Oka
|
Masashi Ishii
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Polymers are widely used in diverse fields, and the demand for efficient methods to extract and organize information about them is increasing. An automated approach that utilizes machine learning can accurately extract relevant information from scientific papers, providing a promising solution for automating information extraction using annotated training data. In this paper, we introduce a polymer-relevant ontology featuring crucial entities and relations to enhance information extraction in the polymer science field. Our ontology is customizable to adapt to specific research needs. We present PolyNERE, a high-quality named entity recognition (NER) and relation extraction (RE) corpus comprising 750 polymer abstracts annotated using our ontology. Distinctive features of PolyNERE include multiple entity types, relation categories, support for various NER settings, and the ability to assert entities and relations at different levels. PolyNERE also facilitates reasoning in the RE task through supporting evidence. While our experiments with recent advanced methods achieved promising results, challenges persist in adapting NER and RE from abstracts to full-text paragraphs. This emphasizes the need for robust information extraction systems in the polymer domain, making our corpus a valuable benchmark for future developments.
pdf
bib
abs
Synthetic Context with LLM for Entity Linking from Scientific Tables
Yuji Oshima
|
Hiroyuki Shindo
|
Hiroki Teranishi
|
Hiroki Ouchi
|
Taro Watanabe
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Tables in scientific papers contain crucial information, such as experimental results.Entity Linking (EL) is a promising technology that analyses tables and associates them with a knowledge base.EL for table cells requires identifying the referent concept of each cell while understanding the context relevant to each cell in the paper. However, extracting the relevant context from the paper is challenging because the relevant parts are scattered in the main text and captions.This study defines a rule-based method for extracting broad context from the main text, including table captions and sentences that mention the table.Furthermore, we propose synthetic context as a more refined context generated by large language models (LLMs).In a synthetic context, contexts from the entire paper are refined by summarizing, injecting supplemental knowledge, and clarifying the referent concept.We observe this approach improves accuracy for EL by more than 10 points on the S2abEL dataset, and our qualitative analysis suggests potential future works.
2022
pdf
bib
abs
Coordination Generation via Synchronized Text-Infilling
Hiroki Teranishi
|
Yuji Matsumoto
Proceedings of the 29th International Conference on Computational Linguistics
Generating synthetic data for supervised learning from large-scale pre-trained language models has enhanced performances across several NLP tasks, especially in low-resource scenarios. In particular, many studies of data augmentation employ masked language models to replace words with other words in a sentence. However, most of them are evaluated on sentence classification tasks and cannot immediately be applied to tasks related to the sentence structure. In this paper, we propose a simple yet effective approach to generating sentences with a coordinate structure in which the boundaries of its conjuncts are explicitly specified. For a given span in a sentence, our method embeds a mask with a coordinating conjunction in two ways (”X and [mask]”, ”[mask] and X”) and forces masked language models to fill the two blanks with an identical text. To achieve this, we introduce decoding methods for BERT and T5 models with the constraint that predictions for different masks are synchronized. Furthermore, we develop a training framework that effectively selects synthetic examples for the supervised coordination disambiguation task. We demonstrate that our method produces promising coordination instances that provide gains for the task in low-resource settings.
2020
pdf
bib
abs
Coordination Boundary Identification without Labeled Data for Compound Terms Disambiguation
Yuya Sawada
|
Takashi Wada
|
Takayoshi Shibahara
|
Hiroki Teranishi
|
Shuhei Kondo
|
Hiroyuki Shindo
|
Taro Watanabe
|
Yuji Matsumoto
Proceedings of the 28th International Conference on Computational Linguistics
We propose a simple method for nominal coordination boundary identification. As the main strength of our method, it can identify the coordination boundaries without training on labeled data, and can be applied even if coordination structure annotations are not available. Our system employs pre-trained word embeddings to measure the similarities of words and detects the span of coordination, assuming that conjuncts share syntactic and semantic similarities. We demonstrate that our method yields good results in identifying coordinated noun phrases in the GENIA corpus and is comparable to a recent supervised method for the case when the coordinator conjoins simple noun phrases.
2019
pdf
bib
abs
Decomposed Local Models for Coordinate Structure Parsing
Hiroki Teranishi
|
Hiroyuki Shindo
|
Yuji Matsumoto
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
We propose a simple and accurate model for coordination boundary identification. Our model decomposes the task into three sub-tasks during training; finding a coordinator, identifying inside boundaries of a pair of conjuncts, and selecting outside boundaries of it. For inference, we make use of probabilities of coordinators and conjuncts in the CKY parsing to find the optimal combination of coordinate structures. Experimental results demonstrate that our model achieves state-of-the-art results, ensuring that the global structure of coordinations is consistent.
2017
pdf
bib
abs
Coordination Boundary Identification with Similarity and Replaceability
Hiroki Teranishi
|
Hiroyuki Shindo
|
Yuji Matsumoto
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
We propose a neural network model for coordination boundary detection. Our method relies on the two common properties - similarity and replaceability in conjuncts - in order to detect both similar pairs of conjuncts and dissimilar pairs of conjuncts. The model improves identification of clause-level coordination using bidirectional RNNs incorporating two properties as features. We show that our model outperforms the existing state-of-the-art methods on the coordination annotated Penn Treebank and Genia corpus without any syntactic information from parsers.