2020
pdf
bib
abs
Evaluation Dataset and Methodology for Extracting Application-Specific Taxonomies from the Wikipedia Knowledge Graph
Georgeta Bordea
|
Stefano Faralli
|
Fleur Mougin
|
Paul Buitelaar
|
Gayo Diallo
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this work, we address the task of extracting application-specific taxonomies from the category hierarchy of Wikipedia. Previous work on pruning the Wikipedia knowledge graph relied on silver standard taxonomies which can only be automatically extracted for a small subset of domains rooted in relatively focused nodes, placed at an intermediate level in the knowledge graphs. In this work, we propose an iterative methodology to extract an application-specific gold standard dataset from a knowledge graph and an evaluation framework to comparatively assess the quality of noisy automatically extracted taxonomies. We employ an existing state of the art algorithm in an iterative manner and we propose several sampling strategies to reduce the amount of manual work needed for evaluation. A first gold standard dataset is released to the research community for this task along with a companion evaluation framework. This dataset addresses a real-world application from the medical domain, namely the extraction of food-drug and herb-drug interactions.
2019
pdf
bib
abs
Query selection methods for automated corpora construction with a use case in food-drug interactions
Georgeta Bordea
|
Tsanta Randriatsitohaina
|
Fleur Mougin
|
Natalia Grabar
|
Thierry Hamon
Proceedings of the 18th BioNLP Workshop and Shared Task
In this paper, we address the problem of automatically constructing a relevant corpus of scientific articles about food-drug interactions. There is a growing number of scientific publications that describe food-drug interactions but currently building a high-coverage corpus that can be used for information extraction purposes is not trivial. We investigate several methods for automating the query selection process using an expert-curated corpus of food-drug interactions. Our experiments show that index term features along with a decision tree classifier are the best approach for this task and that feature selection approaches and in particular gain ratio outperform frequency-based methods for query selection.
2016
pdf
bib
abs
Forecasting Emerging Trends from Scientific Literature
Kartik Asooja
|
Georgeta Bordea
|
Gabriela Vulcu
|
Paul Buitelaar
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Text analysis methods for the automatic identification of emerging technologies by analyzing the scientific publications, are gaining attention because of their socio-economic impact. The approaches so far have been mainly focused on retrospective analysis by mapping scientific topic evolution over time. We propose regression based approaches to predict future keyword distribution. The prediction is based on historical data of the keywords, which in our case, are LREC conference proceedings. Considering the insufficient number of data points available from LREC proceedings, we do not employ standard time series forecasting methods. We form a dataset by extracting the keywords from previous year proceedings and quantify their yearly relevance using tf-idf scores. This dataset additionally contains ranked lists of related keywords and experts for each keyword.
pdf
bib
SemEval-2016 Task 13: Taxonomy Extraction Evaluation (TExEval-2)
Georgeta Bordea
|
Els Lefever
|
Paul Buitelaar
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
2015
pdf
bib
Non-Orthogonal Explicit Semantic Analysis
Nitish Aggarwal
|
Kartik Asooja
|
Georgeta Bordea
|
Paul Buitelaar
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics
pdf
bib
SemEval-2015 Task 17: Taxonomy Extraction Evaluation (TExEval)
Georgeta Bordea
|
Paul Buitelaar
|
Stefano Faralli
|
Roberto Navigli
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2014
pdf
bib
abs
Hot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings
Paul Buitelaar
|
Georgeta Bordea
|
Barry Coughlan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we present a comparative analysis of two series of conferences in the field of Computational Linguistics, the LREC conference and the ACL conference. Conference proceedings were analysed using Saffron by performing term extraction and topical hierarchy construction with the goal of analysing topic trends and research communities. The system aims to provide insight into a research community and to guide publication and participation strategies, especially of novice researchers.
2012
pdf
bib
abs
Semi-Supervised Technical Term Tagging With Minimal User Feedback
Behrang QasemiZadeh
|
Paul Buitelaar
|
Tianqi Chen
|
Georgeta Bordea
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In this paper, we address the problem of extracting technical terms automatically from an unannotated corpus. We introduce a technology term tagger that is based on Liblinear Support Vector Machines and employs linguistic features including Part of Speech tags and Dependency Structures, in addition to user feedback to perform the task of identification of technology related terms. Our experiments show the applicability of our approach as witnessed by acceptable results on precision and recall.
pdf
bib
abs
Expertise Mining for Enterprise Content Management
Georgeta Bordea
|
Sabrina Kirrane
|
Paul Buitelaar
|
Bianca Pereira
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Enterprise content analysis and platform configuration for enterprise content management is often carried out by external consultants that are not necessarily domain experts. In this paper, we propose a set of methods for automatic content analysis that allow users to gain a high level view of the enterprise content. Here, a main concern is the automatic identification of key stakeholders that should ideally be involved in analysis interviews. The proposed approach employs recent advances in term extraction, semantic term grounding, expert profiling and expert finding in an enterprise content management setting. Extracted terms are evaluated using human judges, while term grounding is evaluated using a manually created gold standard for the DBpedia datasource.
2010
pdf
bib
DERIUNLP: A Context Based Approach to Automatic Keyphrase Extraction
Georgeta Bordea
|
Paul Buitelaar
Proceedings of the 5th International Workshop on Semantic Evaluation