Evaluation Dataset and Methodology for Extracting Application-Specific Taxonomies from the Wikipedia Knowledge Graph
Georgeta Bordea | Stefano Faralli | Fleur Mougin | Paul Buitelaar | Gayo Diallo
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this work, we address the task of extracting application-specific taxonomies from the category hierarchy of Wikipedia. Previous work on pruning the Wikipedia knowledge graph relied on silver standard taxonomies which can only be automatically extracted for a small subset of domains rooted in relatively focused nodes, placed at an intermediate level in the knowledge graphs. In this work, we propose an iterative methodology to extract an application-specific gold standard dataset from a knowledge graph and an evaluation framework to comparatively assess the quality of noisy automatically extracted taxonomies. We employ an existing state of the art algorithm in an iterative manner and we propose several sampling strategies to reduce the amount of manual work needed for evaluation. A first gold standard dataset is released to the research community for this task along with a companion evaluation framework. This dataset addresses a real-world application from the medical domain, namely the extraction of food-drug and herb-drug interactions.
Query selection methods for automated corpora construction with a use case in food-drug interactions
Georgeta Bordea | Tsanta Randriatsitohaina | Fleur Mougin | Natalia Grabar | Thierry Hamon
Proceedings of the 18th BioNLP Workshop and Shared Task
In this paper, we address the problem of automatically constructing a relevant corpus of scientific articles about food-drug interactions. There is a growing number of scientific publications that describe food-drug interactions but currently building a high-coverage corpus that can be used for information extraction purposes is not trivial. We investigate several methods for automating the query selection process using an expert-curated corpus of food-drug interactions. Our experiments show that index term features along with a decision tree classifier are the best approach for this task and that feature selection approaches and in particular gain ratio outperform frequency-based methods for query selection.
POMELO: Medline corpus with manually annotated food-drug interactions
Thierry Hamon | Vincent Tabanou | Fleur Mougin | Natalia Grabar | Frantz Thiessard
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017
When patients take more than one medication, they may be at risk of drug interactions, which means that a given drug can cause unexpected effects when taken in combination with other drugs. Similar effects may occur when drugs are taken together with some food or beverages. For instance, grapefruit has interactions with several drugs, because its active ingredients inhibit enzymes involved in the drugs metabolism and can then cause an excessive dosage of these drugs. Yet, information on food/drug interactions is poorly researched. The current research is mainly provided by the medical domain and a very tentative work is provided by computer sciences and NLP domains. One factor that motivates the research is related to the availability of the annotated corpora and the reference data. The purpose of our work is to describe the rationale and approach for creation and annotation of scientific corpus with information on food/drug interactions. This corpus contains 639 MEDLINE citations (titles and abstracts), corresponding to 5,752 sentences. It is manually annotated by two experts. The corpus is named POMELO. This annotated corpus will be made available for the research purposes.
- Georgeta Bordea 2
- Natalia Grabar 2
- Thierry Hamon 2
- Tsanta Randriatsitohaina 1
- Vincent Tabanou 1
- show all...