WEXEA: Wikipedia EXhaustive Entity Annotation

Michael Strobl, Amine Trabelsi, Osmar Zaiane


Abstract
Building predictive models for information extraction from text, such as named entity recognition or the extraction of semantic relationships between named entities in text, requires a large corpus of annotated text. Wikipedia is often used as a corpus for these tasks where the annotation is a named entity linked by a hyperlink to its article. However, editors on Wikipedia are only expected to link these mentions in order to help the reader to understand the content, but are discouraged from adding links that do not add any benefit for understanding an article. Therefore, many mentions of popular entities (such as countries or popular events in history), or previously linked articles, as well as the article’s entity itself, are not linked. In this paper, we discuss WEXEA, a Wikipedia EXhaustive Entity Annotation system, to create a text corpus based on Wikipedia with exhaustive annotations of entity mentions, i.e. linking all mentions of entities to their corresponding articles. This results in a huge potential for additional annotations that can be used for downstream NLP tasks, such as Relation Extraction. We show that our annotations are useful for creating distantly supervised datasets for this task. Furthermore, we publish all code necessary to derive a corpus from a raw Wikipedia dump, so that it can be reproduced by everyone.
Anthology ID:
2020.lrec-1.240
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1951–1958
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.240
DOI:
Bibkey:
Cite (ACL):
Michael Strobl, Amine Trabelsi, and Osmar Zaiane. 2020. WEXEA: Wikipedia EXhaustive Entity Annotation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1951–1958, Marseille, France. European Language Resources Association.
Cite (Informal):
WEXEA: Wikipedia EXhaustive Entity Annotation (Strobl et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.240.pdf