TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly


Abstract
Tasks, Datasets and Evaluation Metrics are important concepts for understanding experimental scientific papers. However, previous work on information extraction for scientific literature mainly focuses on the abstracts only, and does not treat datasets as a separate type of entity (Zadeh and Schumann, 2016; Luan et al., 2018). In this paper, we present a new corpus that contains domain expert annotations for Task (T), Dataset (D), Metric (M) entities 2,000 sentences extracted from NLP papers. We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL Anthology. The corpus is made publicly available to the community for fostering research on scientific publication summarization (Erera et al., 2019) and knowledge discovery.
Anthology ID:
2021.eacl-main.59
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
707–714
Language:
URL:
https://aclanthology.org/2021.eacl-main.59
DOI:
10.18653/v1/2021.eacl-main.59
Bibkey:
Cite (ACL):
Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, and Debasis Ganguly. 2021. TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 707–714, Online. Association for Computational Linguistics.
Cite (Informal):
TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics (Hou et al., EACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eacl-main.59.pdf
Code
 IBM/science-result-extractor
Data
SciERCSciREX