SciREX: A Challenge Dataset for Document-Level Information Extraction

Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, Iz Beltagy


Abstract
Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX .
Anthology ID:
2020.acl-main.670
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7506–7516
Language:
URL:
https://aclanthology.org/2020.acl-main.670
DOI:
10.18653/v1/2020.acl-main.670
Bibkey:
Cite (ACL):
Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, and Iz Beltagy. 2020. SciREX: A Challenge Dataset for Document-Level Information Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516, Online. Association for Computational Linguistics.
Cite (Informal):
SciREX: A Challenge Dataset for Document-Level Information Extraction (Jain et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.670.pdf
Dataset:
 2020.acl-main.670.Dataset.tgz
Video:
 http://slideslive.com/38929290
Code
 allenai/SciREX
Data
SciREX