A Dataset for Open Event Extraction in English

Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon


Abstract
This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.
Anthology ID:
L16-1307
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1939–1943
Language:
URL:
https://aclanthology.org/L16-1307
DOI:
Bibkey:
Cite (ACL):
Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, and Romaric Besançon. 2016. A Dataset for Open Event Extraction in English. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1939–1943, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
A Dataset for Open Event Extraction in English (Nguyen et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1307.pdf